Tokenizer Training Mismatch

Ucas-HaoranWei / Vary

[ECCV 2024] Official code implementation of Vary: Scaling Up the Vision Vocabulary of Large Vision Language Models.

1.77k stars 156 forks source link

Tokenizer Training Mismatch #79

Closed dlin511 closed 7 months ago

dlin511 commented 7 months ago

When loading from huggingface, the tokenizer for Qwen has length 151851 (using the AutoTokenizer class as is done in the training script).

However, the final length of the tokenizer of your model is 151860. In the training should we be initializing the custom QwenTokenizer class instead and loading the qwen.tiktoken file or do we need to manually add the 9 special tokens?