GlobalMaksimum / sadedegel

A General Purpose NLP library for Turkish
http://sadedegel.ai
MIT License
93 stars 13 forks source link

Standardized tokenizer #214

Closed askarbozcan closed 3 years ago

askarbozcan commented 3 years ago

As of now we have 2 tokenizers namely BERTTokenizer and SimpleTokenizer with default being the BERT tokenizer (WordPiece tokenizer in actuality.)

However both have issues:

For this purpose I propose:

  1. Creating a (relatively) small hand-tokenized dataset.
  2. Measure SimpleTokenizer's performance on this dataset.
  3. Improving SimpleTokenizer to cover almost all cases from the dataset.

Alternative:

  1. Test out ICUTokenizer (https://pypi.org/project/icu-tokenizer/)

The main reason is that a lot of algorithms whether it be spelling correction or FastText expect a word as its input and will work quite poorly with BERT (WordPiece) tokens.

Addendum; an example of tokenization.

"Ben İstanbul'a gittim.." => ["Ben", "İstanbul"," ' ", "a", "gittim" , ".", "."]

husnusensoy commented 3 years ago

You re right and we are already there. One of the purposes of tscorpus is actually tokenizer evaluation since 0.16. Please ensure that you have checked tokenizer.md

We have already performed icu tests (what they call is boundary detection in general for both word and sentences) in an feature branch.

Challange is that icu is a python c/c++ binding which makes things a bit complicated for wheel pypi users (solution exists for conda)

We WILL/SHOULD do that addition but we need to ensure that it does not break something.

0.18 is almost there. I believe, icu will be the major change for 0.19 (maybe we will even deprecate simple tokenizer)

Thanx buddy

husnusensoy commented 3 years ago

We are done with this also. icu is on board :)