Standardized tokenizer - Githubissues

askarbozcan commented 3 years ago

As of now we have 2 tokenizers namely BERTTokenizer and SimpleTokenizer with default being the BERT tokenizer (WordPiece tokenizer in actuality.)

However both have issues:

BERT Tokenizer: Wordpiece tokenization is good for ML models when you want to avoid OOV tokens yet your model's vocabulary size cannot feasibly contain all the possible tokens. However for purposes of tokenization it's by far not the best and sometimes cuts off words in unusual places.
Simple Tokenizer: Best canidate right now for the new default tokenizer, however there is uncertainty about its efficacy.

For this purpose I propose:

Creating a (relatively) small hand-tokenized dataset.
Measure SimpleTokenizer's performance on this dataset.
Improving SimpleTokenizer to cover almost all cases from the dataset.

Alternative:

Test out ICUTokenizer (https://pypi.org/project/icu-tokenizer/)

The main reason is that a lot of algorithms whether it be spelling correction or FastText expect a word as its input and will work quite poorly with BERT (WordPiece) tokens.

Addendum; an example of tokenization.

"Ben İstanbul'a gittim.." => ["Ben", "İstanbul"," ' ", "a", "gittim" , ".", "."]

husnusensoy commented 3 years ago

You re right and we are already there. One of the purposes of tscorpus is actually tokenizer evaluation since 0.16. Please ensure that you have checked tokenizer.md

We have already performed icu tests (what they call is boundary detection in general for both word and sentences) in an feature branch.

it is much faster than our existing word tokenizers
icu has an higher accuracy(iou) for word tokenization/boundary detection
it is worse than our ml based sbd in accuracy ( but remember that our sbd is based on word tokenizer;) so we can combine best of two)

Challange is that icu is a python c/c++ binding which makes things a bit complicated for wheel pypi users (solution exists for conda)

We WILL/SHOULD do that addition but we need to ensure that it does not break something.

0.18 is almost there. I believe, icu will be the major change for 0.19 (maybe we will even deprecate simple tokenizer)

Thanx buddy

husnusensoy commented 3 years ago

We are done with this also. icu is on board :)

GlobalMaksimum / sadedegel

Standardized tokenizer #214