bminixhofer / zett

Code for Zero-Shot Tokenizer Transfer
https://arxiv.org/abs/2405.07883
101 stars 7 forks source link

A question #3

Closed noforit closed 1 month ago

noforit commented 1 month ago

Thank you very much for your work. I have a few questions regarding some points in the paper that I found confusing, perhaps due to my own misunderstanding.

Could you please explain how the conversion is performed given a specific target vocabulary and tokenizer? In Algorithm 1, each step dynamically selects a UnigramLM vocabulary and tokenizer. How does this align with the final target configuration? Looking forward to your clarification.

bminixhofer commented 1 month ago

For a specific target tokenizer, we take the vocabulary, convert it to byte level, then run it through the hypernetwork. Since the hypernetwork amortizes over the tokenization, we don't need to use it, and it doesn't matter whether the final target tokenizer is e.g. BPE or UnigramLM. I hope that clarifies things.