Open tonyaw opened 6 months ago
same question not clear about the same pad token with eos token in 33b-code-base https://huggingface.co/deepseek-ai/deepseek-coder-33b-base/blob/main/tokenizer_config.json but not the same ( pad token and eos token ) in instruct models: https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct/blob/main/tokenizer_config.json
code snippet from instruct model: "eos_token": { "type": "AddedToken", "content": "<|EOT|>", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false }, "legacy": true, "model_max_length": 16384, "pad_token": { "type": "AddedToken", "content": "<|end▁of▁sentence|>", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false }
Related question, how does the FIM model learn to stop generation when the EOS and PAD tokens are the same, and so the model never learns to predict the EOS token as it is always masked?
Dear experts, I found there are two pad tokens in deepseek-coder. What's the difference between them? When I need to use pad token, which one shall I use?
Also, why the second pad token is same as token 32014? I assume it is on purpose. Could you please help to explain the reason?