Closed nljubesi closed 2 years ago
This was reported by @mrspock434.
And this seems to be a obeliks issue, so assigning @msinkec here as well. Miha, \x1f
seems to break obeliks on the conllu generation step.
Is this character supposed to be a n with tilde? https://www.fileformat.info/info/unicode/char/f1/index.htm
The problematic character code was wrongly written in the title, but good in the message. \x1f
is an ASCII control character (unit separator). It probably occurred as noise, but should still not be able to bring the CONLL-U generation process of obeliks.
Should these control characters be excluded from the result or kept in? They currently get market as a punctuation character.
@simonkrek ? We are discussign the \x1f control character that brings down obeliks during conllu generation. Reported by UM people, they got it from some PDF conversion.
I think these should be rather removed. These are highly infrequent and should be dealt with, in a reasonable setup, via pre-processing. However, they should not be able to break down our tools.
I agree. Non-printable control characters such as \x1f should be removed by Obeliks.
Should be fixed in Obeliks 1.1.4 https://github.com/clarinsi/obeliks/releases/tag/1.1.4
It is. Closing.
Describe the bug The obeliks tokenizer breaks down on the following sequence:
:\x1f
. The first character can be any character. The breakage occurs during the preparation of conllu output.To Reproduce
Additional context
reldi-tokeniser does not break. obeliks in CLI (if conllu is not produced) does not break either.
Could there be a solution that would catch such cases and generate conllu output regardless of that?