Open kahlep opened 5 years ago
This issue also affects other Unicode categories besides surrogates, e.g. the "unassigned" category.
The CITlabTokenizer 1.0 used the categerization in Java, e.g. Character.isSurrogate(char ch)
, which implements the Unicode 6.2 specification.
CITlabTokenizer 1.1.0 relies on an internal lookup from a text file to support Unicode 12.1 where character assignements were added but also changed in some cases.
The tokenizer 1.1.0 is now included as dependency with TranskribusCore and can be used to do the checks described initially: de.uros.citlab.tokenizer.categorizer.CategorizerWordMergeGroups::getCategory(char c)
will throw a RuntimeException
on illegal chars.
CITlab HTR(+) training will drop all lines that contain unicode surrogates (see Transkribus/TranskribusAppServerModules#59). For each line a JobError is stored and is shown in the job overview.
User should be warned about this restriction when such a character is entered in the transcription widget (copy-paste?) and possibly via the virtual keyboard (if it allows to map surrogate chars).
The check can be done withCharacter.isSurrogate(char ch)
.