alasdairforsythe / tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
MIT License
528 stars 20 forks source link

Inquiry on Extending Algorithm to Other Languages #30

Open dsdanielpark opened 7 months ago

dsdanielpark commented 7 months ago

Impressed by Your Project

Dear alasdairforsythe,

I am genuinely impressed by your wonderful project and appreciate your sharing it. Thank you sincerely.

Inquiry on Documentation and Algorithm

I'm curious to know if there is any simple explanation or documentation about the entire development process of your project.

If not, could you please provide a brief description of the overall algorithm, even if it's very approximate? I am familiar with concepts like BPE, BBPE, unigram, ngram, and word piece, as well as various packages like SentencePiece, TikToken, tokenizers, and transformers. Therefore, feel free to skip any basic information and directly share what improvements you've made, the overall development process, your objectives, and the approaches you took to solve specific problems.

Inquiry on Extending Algorithm to Other Languages

I read on Reddit that your focus was on speed improvements, but I noticed you also reduced the vocab size. Could you elaborate on your overall approach to this?

Additionally, I am curious about where to start with your package to develop an efficient tokenizer for Korean. While I'm considering the BBPE method for creating an efficient Korean vocab, your advanced work in this area has prompted me to reach out for guidance.

Thank you for your time and insights.

Sincerely, Daniel

alasdairforsythe commented 7 months ago

I'll answer briefly: The training algorithm uses brute force to find the optimal set of tokens to represent your chosen dataset, given any specific tokenization algorithm. And you can see it works, because if you run it multiple times, you get the same tokens out (give or take a couple that are roughly equal.) All the cleverness of my code is to get it to be able to do this quickly enough. So basically, the training process doesn't have an opinion about information theory or compression - I didn't even bother to research that. It just tries everything, and my specialty of micro-optimization means I programmed it to be fast enough to be usable.

dsdanielpark commented 7 months ago

alasdairforsythe

Thank you for the kind response. I will also try optimizing BBPE and some algorithms and provide feedback on TokenMonster. Additionally, if I have any questions while creating the tokenizer, I will make sure to ask. Thank you for this wonderful project.