alasdairforsythe / tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
MIT License
528 stars 20 forks source link

Idea: Wouldn't it be possible for Tokenmonster to stop when it reaches the idea vocab size? #20

Closed Calvinnncy97 closed 10 months ago

Calvinnncy97 commented 10 months ago

Just an idea. When I train Tokenmonster with an immensely big dataset, I realize at a certain small vocab size, the workers struggle to remove anymore from the vocab. I take it as a sign that the vocab-size is already approaching optimal when it reaches that state.

What do you think?

alasdairforsythe commented 10 months ago

The reason why the workers struggle to remove tokens is because another worker removed the same token. This happens because the workers are processing in parallel. It occurs as the target vocab size is being reached, because the different workers are processing almost the same vocabularies with only a few differences, so there is a high probability that multiple threads find the same "worst" token.

Calvinnncy97 commented 10 months ago

I see. Thank you for explaining!