Emu3 is a good work, but i have some question on it.
The vocabulary size of Qwen is 152064 , while the codebook size of vision tokenizer employed in Emu3 is 32768
The addation is 184832, the vocabulary size reported in Emu3 is 184622.
Why do the numbers not match?
We use the vacab.json in Qwen2 which have 151643 tokens, plus 32768 vision tokens, 205 extra tokens and 6 special tokens, making the total vocabulary size of 184622.
Emu3 is a good work, but i have some question on it. The vocabulary size of Qwen is 152064 , while the codebook size of vision tokenizer employed in Emu3 is 32768 The addation is 184832, the vocabulary size reported in Emu3 is 184622. Why do the numbers not match?