Transkribus / TranskribusSwtGui

Note: the repo has been moved to https://gitlab.com/readcoop/Transkribus/TranskribusSwtGui
GNU General Public License v3.0
18 stars 4 forks source link

ATranscriptionWidget & Virtual Keyboard: warn on input of surrogate chars #277

Open kahlep opened 5 years ago

kahlep commented 5 years ago

CITlab HTR(+) training will drop all lines that contain unicode surrogates (see Transkribus/TranskribusAppServerModules#59). For each line a JobError is stored and is shown in the job overview.

User should be warned about this restriction when such a character is entered in the transcription widget (copy-paste?) and possibly via the virtual keyboard (if it allows to map surrogate chars). The check can be done with Character.isSurrogate(char ch).

kahlep commented 5 years ago

This issue also affects other Unicode categories besides surrogates, e.g. the "unassigned" category.

The CITlabTokenizer 1.0 used the categerization in Java, e.g. Character.isSurrogate(char ch), which implements the Unicode 6.2 specification. CITlabTokenizer 1.1.0 relies on an internal lookup from a text file to support Unicode 12.1 where character assignements were added but also changed in some cases.

The tokenizer 1.1.0 is now included as dependency with TranskribusCore and can be used to do the checks described initially: de.uros.citlab.tokenizer.categorizer.CategorizerWordMergeGroups::getCategory(char c) will throw a RuntimeException on illegal chars.