aboSamoor / polyglot

Multilingual text (NLP) processing toolkit
http://polyglot-nlp.com
Other
2.31k stars 337 forks source link

Tokenization malfunction after strange character #41

Closed remibolcom closed 8 years ago

remibolcom commented 8 years ago

Text("Edge cases " + chr(917631) + " can be annoying.").words WordList(['Edge', 'cases', '\U000e007f', 'c', 'an', 'b', 'e', 'a', 'nnoying.'])

alantian commented 8 years ago

For some of the i18n functionality, Polyglot is using PyICU (https://github.com/ovalhub/pyicu) which is merely a Python wrapper for ICU (http://site.icu-project.org/). Some debugging shows that the mistake using the provided example happens inside ICU which is beyond the reach of Polyglot, thus currently there is little we can do.