Closed gudgyo closed 3 years ago
Thanks for the report! The trailing whitespace issue is definitely a bug that we can fix easily.
The Bulgarian error looks like a separate issue within stanza
, though? You could try with plain stanza
to see if you get that error with the same text?
Yes, the bulgarian error remains with plain stanza, and seems to be language specific. Note: The trailing whitespace issue occurs when the string ends with at least 2 whitespaces.
Versions: spacy-stanza 0.2.4 stanza 1.1.1
Description: the following string throws error on the tokenizer: "?\n" How to reproduce error:
Update: Any given character followed by a newline '\n' and no other character produces the same error. eg.:
nlp("example\n")
->errornlp("example2\n ")
-> errornlp("example\nend")
-> runsUpdate 2: Character followed by two spaces also produce the same error, for some reason special characters work this way. eg.:
nlp("example ")
->errornlp("example2 ")
-> errornlp("\n ")
-> runsnlp("\t ")
-> runsError:
Update 3: This particular string produces an error (language bulgarian, gpu True, all processors used): "Думи и срички: Горско училище ......................9 Буквен етап • "
Error:
However after deleting a single dot from the string, we get the following warning instead of the error:
With the tokenized output: "Думи и срички : Горско училище ...... . . . . . . . . . . ...9 Буквен етап •"