Closed askarbozcan closed 3 years ago
You re right and we are already there. One of the purposes of tscorpus is actually tokenizer evaluation since 0.16. Please ensure that you have checked tokenizer.md
We have already performed icu tests (what they call is boundary detection in general for both word and sentences) in an feature branch.
Challange is that icu is a python c/c++ binding which makes things a bit complicated for wheel pypi users (solution exists for conda)
We WILL/SHOULD do that addition but we need to ensure that it does not break something.
0.18 is almost there. I believe, icu will be the major change for 0.19 (maybe we will even deprecate simple tokenizer)
Thanx buddy
We are done with this also. icu is on board :)
As of now we have 2 tokenizers namely BERTTokenizer and SimpleTokenizer with default being the BERT tokenizer (WordPiece tokenizer in actuality.)
However both have issues:
For this purpose I propose:
Alternative:
The main reason is that a lot of algorithms whether it be spelling correction or FastText expect a word as its input and will work quite poorly with BERT (WordPiece) tokens.
Addendum; an example of tokenization.
"Ben İstanbul'a gittim.." => ["Ben", "İstanbul"," ' ", "a", "gittim" , ".", "."]