QwenLM / Qwen2

Qwen2 is the large language model series developed by Qwen team, Alibaba Cloud.
7.43k stars 454 forks source link

About the vocabulary inconsistence #885

Open patrick-tssn opened 2 weeks ago

patrick-tssn commented 2 weeks ago
          For tokenizers in `transformers`, in convention, `tokenizer.vocab_size` [as documented](https://github.com/huggingface/transformers/blob/092f1fdaa4224fdd88c616dc9678e6fcb37bfffd/src/transformers/tokenization_utils.py#L378-L383) is the size of the base vocabulary (without the added tokens). To get the actual vocabulary size, you need to use `len(tokenizer)`, which is 151646 for Qwen1.5 models.

The vocab_size in config.json is the number of embeddings, which can be larger than the acutal vocabulary size because of optimization for GPU computation and other consideration. 152064 can be divided by 256 and 151926 can be divided by 128.

Originally posted by @jklj077 in https://github.com/QwenLM/Qwen2/issues/147#issuecomment-1988398254

patrick-tssn commented 2 weeks ago

What to do when the generated token index is not included in the vocabulary?

jklj077 commented 2 weeks ago

Hi, if that happens, e.g, 151647 is sampled, then the model or the inference should be considered broken, because those tokens are never seen in training. We haven't met that before. Could you please share steps to reproduce?