Closed Calvinnncy97 closed 10 months ago
The reason why the workers struggle to remove tokens is because another worker removed the same token. This happens because the workers are processing in parallel. It occurs as the target vocab size is being reached, because the different workers are processing almost the same vocabularies with only a few differences, so there is a high probability that multiple threads find the same "worst" token.
I see. Thank you for explaining!
Just an idea. When I train Tokenmonster with an immensely big dataset, I realize at a certain small vocab size, the workers struggle to remove anymore from the vocab. I take it as a sign that the vocab-size is already approaching optimal when it reaches that state.
What do you think?