About the vocabulary inconsistence

patrick-tssn commented 2 months ago

          For tokenizers in `transformers`, in convention, `tokenizer.vocab_size` [as documented](https://github.com/huggingface/transformers/blob/092f1fdaa4224fdd88c616dc9678e6fcb37bfffd/src/transformers/tokenization_utils.py#L378-L383) is the size of the base vocabulary (without the added tokens). To get the actual vocabulary size, you need to use `len(tokenizer)`, which is 151646 for Qwen1.5 models.

The vocab_size in config.json is the number of embeddings, which can be larger than the acutal vocabulary size because of optimization for GPU computation and other consideration. 152064 can be divided by 256 and 151926 can be divided by 128.

Originally posted by @jklj077 in https://github.com/QwenLM/Qwen2/issues/147#issuecomment-1988398254

patrick-tssn commented 2 months ago

What to do when the generated token index is not included in the vocabulary?

jklj077 commented 2 months ago

Hi, if that happens, e.g, 151647 is sampled, then the model or the inference should be considered broken, because those tokens are never seen in training. We haven't met that before. Could you please share steps to reproduce?

github-actions[bot] commented 1 month ago

This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread.

QwenLM / Qwen2.5

About the vocabulary inconsistence #885