ATranscriptionWidget & Virtual Keyboard: warn on input of surrogate chars

Transkribus / TranskribusSwtGui

Note: the repo has been moved to https://gitlab.com/readcoop/Transkribus/TranskribusSwtGui

GNU General Public License v3.0

18 stars 4 forks source link

This issue also affects other Unicode categories besides surrogates, e.g. the "unassigned" category.

The CITlabTokenizer 1.0 used the categerization in Java, e.g. Character.isSurrogate(char ch), which implements the Unicode 6.2 specification. CITlabTokenizer 1.1.0 relies on an internal lookup from a text file to support Unicode 12.1 where character assignements were added but also changed in some cases.

The tokenizer 1.1.0 is now included as dependency with TranskribusCore and can be used to do the checks described initially: de.uros.citlab.tokenizer.categorizer.CategorizerWordMergeGroups::getCategory(char c) will throw a RuntimeException on illegal chars.

Transkribus / TranskribusSwtGui

ATranscriptionWidget & Virtual Keyboard: warn on input of surrogate chars #277