deepseek-ai / DeepSeek-Coder

DeepSeek Coder: Let the Code Write Itself
https://coder.deepseek.com/
MIT License
6.01k stars 433 forks source link

What's the pad token for deepseek-coder #90

Open tonyaw opened 6 months ago

tonyaw commented 6 months ago

Dear experts, I found there are two pad tokens in deepseek-coder. What's the difference between them? When I need to use pad token, which one shall I use?

Also, why the second pad token is same as token 32014? I assume it is on purpose. Could you please help to explain the reason?

    {                                                                                                                                                                                                              
      "id": 32013,                                                                                                                                                                                                 
      "content": "<|begin▁of▁sentence|>",                                                                                                                                                                        
      "single_word": false,                                                                                                                                                                                        
      "lstrip": false,                                                                                                                                                                                             
      "rstrip": false,                                                                                                                                                                                             
      "normalized": true,                                                                                                                                                                                          
      "special": true                                                                                                                                                                                              
    },                                                                                                                                                                                                             
    {                                                                                                                                                                                                              
      "id": 32014,                                                                                                                                                                                                 
      "content": "<|end▁of▁sentence|>",                                                                                                                                                                          
      "single_word": false,                                                                                                                                                                                        
      "lstrip": false,                                                                                                                                                                                             
      "rstrip": false,                                                                                                                                                                                             
      "normalized": true,                                                                                                                                                                                          
      "special": true                                                                                                                                                                                              
    }, 
netrookiecn commented 6 months ago

same question not clear about the same pad token with eos token in 33b-code-base https://huggingface.co/deepseek-ai/deepseek-coder-33b-base/blob/main/tokenizer_config.json but not the same ( pad token and eos token ) in instruct models: https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct/blob/main/tokenizer_config.json

code snippet from instruct model: "eos_token": { "type": "AddedToken", "content": "<|EOT|>", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false }, "legacy": true, "model_max_length": 16384, "pad_token": { "type": "AddedToken", "content": "<|end▁of▁sentence|>", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false }

zhzhang commented 1 month ago

Related question, how does the FIM model learn to stop generation when the EOS and PAD tokens are the same, and so the model never learns to predict the EOS token as it is always masked?