Fix Tokenizer for Correct Tokenization of Texts with Emojis

GlobalMaksimum / sadedegel

A General Purpose NLP library for Turkish

http://sadedegel.ai

MIT License

92 stars 15 forks source link

Fix Tokenizer for Correct Tokenization of Texts with Emojis #272

Closed irmakyucel closed 3 years ago

irmakyucel commented 3 years ago

During the error analysis of Product Review Sentiment Model I saw that the reviews having emojis where tokenized incorrectly. One such example of this incorrect tokenization is 👍basarili ve kaliteli bir urun . > [👍b, asarili, v, e, k, aliteli, b, ir, u, run, .]. There are various cases similar to this one.

I will be exploring this issue to understand the underlying cause and try to fix it accordingly.

husnusensoy commented 3 years ago

For this type of bugs we have implemented exception rule handling. Simple solution might be to create user.ini file into ~/.sadedegel directory and update emoji=true in [tokenizer] section. Once done this is the output

In [1]: from sadedegel.about import __version__

In [2]: __version__
Out[2]: '0.20.1'

In [3]: from sadedegel import Doc

In [4]: Doc("👍basarili ve kaliteli bir urun .").tokens
Out[4]: [👍, basarili, ve, kaliteli, bir, urun, .]