aboSamoor / polyglot

Multilingual text (NLP) processing toolkit
http://polyglot-nlp.com
Other
2.29k stars 337 forks source link

Emojis break word tokenizer #183

Open zafercavdar opened 5 years ago

zafercavdar commented 5 years ago

I was using Polyglot for tokenizing documents with hashtags in Swedish. I recognized that the tokenizer splits the first letters of the hashtags after an emoji as separate words. For example, for the given document,

doc = "#hållbarhet #miljövänligt 😊  #ecofashion #levaobo #miljösmart #influencers"
words = Text(doc, hint_language_code='sv').words
print(words)

it prints

WordList(['#', 'hållbarhet', '#', 'miljövänligt', '😊', '#', 'e', 'cofashion', '#', 'l', 'evaobo', '#', 'm', 'iljösmart', '#', 'i', 'nfluencers'])

While #hållbarhet is tokenized as ["#", "hållbarhet" ], #ecofashion turns into ["#", "e", "cofashion"]