I periodically encounter infinite generations in Qwen 2.5 7B Coder with FP8 quantization when feeding long texts around 20+k characters into the context.
In general, they seem to be consistent across the entire line.
But I have a question: in config.json, "bos_token_id": 151643, which corresponds to "<|endoftext|>" according to the tokenizer, and "eos_token_id": 151645, which corresponds to "<|im_end|>". However, in generation_config.json, "bos_token_id": 151643 "<|endoftext|>" and "pad_token_id": 151643 "<|endoftext|>", and "eos_token_id": [151645, 151643] - a list of two tokens that were previously eos and bos tokens: "<|im_end|>" and "<|endoftext|>". Now, looking at tokenizer_config.json:
"bos_token": null, "eos_token": "<|im_end|>", "pad_token": "<|endoftext|>",
where the bos token should probably be explicitly 151644 - "<|im_start|>" instead of 151643, which is "<|endoftext|>".
In short, these three configs have completely confused me.
We have updated both the special tokens and their corresponding token ids to maintain consistency with Qwen2.5. The new special tokens are as follows:
{
"<|fim_prefix|>": 151659,
"<|fim_middle|>": 151660,
"<|fim_suffix|>": 151661,
"<|fim_pad|>": 151662,
"<|repo_name|>": 151663,
"<|file_sep|>": 151664,
"<|im_start|>": 151644,
"<|im_end|>": 151645
}
How to properly modify config.json, generation_config.json, and tokenizer_config.json??
You don't need to modify these configuration files; we've set them up correctly. If you encounter any bad cases, such as infinite generation, please attach them here, and we might be able to assist you.
I periodically encounter infinite generations in Qwen 2.5 7B Coder with FP8 quantization when feeding long texts around 20+k characters into the context.
I'm looking at their configs: https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct/blob/main/config.json https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct/blob/main/generation_config.json https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct/blob/main/tokenizer_config.json
In general, they seem to be consistent across the entire line. But I have a question: in config.json, "bos_token_id": 151643, which corresponds to "<|endoftext|>" according to the tokenizer, and "eos_token_id": 151645, which corresponds to "<|im_end|>". However, in generation_config.json, "bos_token_id": 151643 "<|endoftext|>" and "pad_token_id": 151643 "<|endoftext|>", and "eos_token_id": [151645, 151643] - a list of two tokens that were previously eos and bos tokens: "<|im_end|>" and "<|endoftext|>". Now, looking at tokenizer_config.json: "bos_token": null, "eos_token": "<|im_end|>", "pad_token": "<|endoftext|>", where the bos token should probably be explicitly 151644 - "<|im_start|>" instead of 151643, which is "<|endoftext|>".
In short, these three configs have completely confused me.
Hmm, I also found this: https://github.com/QwenLM/Qwen2.5-Coder Important
We have updated both the special tokens and their corresponding token ids to maintain consistency with Qwen2.5. The new special tokens are as follows: { "<|fim_prefix|>": 151659, "<|fim_middle|>": 151660, "<|fim_suffix|>": 151661, "<|fim_pad|>": 151662, "<|repo_name|>": 151663, "<|file_sep|>": 151664, "<|im_start|>": 151644, "<|im_end|>": 151645 }
How to properly modify config.json, generation_config.json, and tokenizer_config.json??