Closed yw0nam closed 3 years ago
The performance of the tokenization algorithm is difficult to judge. With wiki documents, it was trained with the sentencepiece algorithm with a vocab size of 8k. Tokenizer is trained by considering word frequency and LM score. First, it seems appropriate to raise an issue along with concerns about how to measure the tokenizer's performance. What is the performance of the KoBERT tokenizer numerically?
Since there is no answer, I am closing the issue.
Here is my codes.
It doesn't look great. Is there any idea for improve the result?
Thanks.