deepseek-ai / DeepSeek-Coder

DeepSeek Coder: Let the Code Write Itself
https://coder.deepseek.com/
MIT License
6.6k stars 461 forks source link

clarification on the sentinel token format #147

Closed Zane-XY closed 5 months ago

Zane-XY commented 6 months ago

In the paper, I noticed that the token used is <|fim_start|>, where it is important to note that the character is not the ASCII |, the underscore _ is an ASCII character. However, in the GitHub repo readme.md, the underscore is represented by , as seen in <|fim▁begin|>. During my experimentation with Ollama, the use of resulted in encoding errors.