Closed patrick-tssn closed 1 month ago
What to do when the generated token index is not included in the vocabulary?
Hi, if that happens, e.g, 151647 is sampled, then the model or the inference should be considered broken, because those tokens are never seen in training. We haven't met that before. Could you please share steps to reproduce?
This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread.
The
vocab_size
inconfig.json
is the number of embeddings, which can be larger than the acutal vocabulary size because of optimization for GPU computation and other consideration. 152064 can be divided by 256 and 151926 can be divided by 128.Originally posted by @jklj077 in https://github.com/QwenLM/Qwen2/issues/147#issuecomment-1988398254