clarinsi / classla

CLASSLA Fork of the Official Stanford NLP Python Library for Many Human Languages
https://www.clarin.si/info/k-centre/
Other
37 stars 17 forks source link

\x1f breaks obeliks + conllu generation #30

Closed nljubesi closed 2 years ago

nljubesi commented 2 years ago

Describe the bug The obeliks tokenizer breaks down on the following sequence: :\x1f. The first character can be any character. The breakage occurs during the preparation of conllu output.

To Reproduce

import classla
nlp=classla.Pipeline('sl')
print(nlp('a\x1f').to_conll())

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/nikola/miniconda3/lib/python3.7/site-packages/classla/pipeline/core.py", line 167, in __call__
    doc = self.process(doc)
  File "/Users/nikola/miniconda3/lib/python3.7/site-packages/classla/pipeline/core.py", line 161, in process
    doc = self.processors[processor_name].process(doc)
  File "/Users/nikola/miniconda3/lib/python3.7/site-packages/classla/pipeline/tokenize_processor.py", line 86, in process
    raw_text, document, metadocument = self._tokenizer.tokenize(document)
  File "/Users/nikola/miniconda3/lib/python3.7/site-packages/classla/utils/obeliks.py", line 38, in tokenize
    for doc in obeliks.run(raw_text, object_output=True):
  File "/Users/nikola/miniconda3/lib/python3.7/site-packages/obeliks/tokenizer.py", line 348, in run
    out = process_text(text, os, None, conllu, pass_newdoc_id, object_output=object_output)
  File "/Users/nikola/miniconda3/lib/python3.7/site-packages/obeliks/tokenizer.py", line 276, in process_text
    out = process_conllu(line, np, os, object_output=object_output)
  File "/Users/nikola/miniconda3/lib/python3.7/site-packages/obeliks/tokenizer.py", line 136, in process_conllu
    attribs = parse_attribs(match.group(2))
  File "/Users/nikola/miniconda3/lib/python3.7/site-packages/obeliks/tokenizer.py", line 246, in parse_attribs
    key, val = token.split('=', 1)
ValueError: not enough values to unpack (expected 2, got 1)

Additional context

reldi-tokeniser does not break. obeliks in CLI (if conllu is not produced) does not break either.

Could there be a solution that would catch such cases and generate conllu output regardless of that?

nljubesi commented 2 years ago

This was reported by @mrspock434.

nljubesi commented 2 years ago

And this seems to be a obeliks issue, so assigning @msinkec here as well. Miha, \x1f seems to break obeliks on the conllu generation step.

msinkec commented 2 years ago

Is this character supposed to be a n with tilde? https://www.fileformat.info/info/unicode/char/f1/index.htm

nljubesi commented 2 years ago

The problematic character code was wrongly written in the title, but good in the message. \x1f is an ASCII control character (unit separator). It probably occurred as noise, but should still not be able to bring the CONLL-U generation process of obeliks.

msinkec commented 2 years ago

Should these control characters be excluded from the result or kept in? They currently get market as a punctuation character.

nljubesi commented 2 years ago

@simonkrek ? We are discussign the \x1f control character that brings down obeliks during conllu generation. Reported by UM people, they got it from some PDF conversion.

I think these should be rather removed. These are highly infrequent and should be dealt with, in a reasonable setup, via pre-processing. However, they should not be able to break down our tools.

simonkrek commented 2 years ago

I agree. Non-printable control characters such as \x1f should be removed by Obeliks.

msinkec commented 2 years ago

Should be fixed in Obeliks 1.1.4 https://github.com/clarinsi/obeliks/releases/tag/1.1.4

nljubesi commented 2 years ago

It is. Closing.