How many tokens of code in pretraining

deepseek-ai / DeepSeek-Coder

DeepSeek Coder: Let the Code Write Itself

https://coder.deepseek.com/

MIT License

5.99k stars 431 forks source link

How many tokens of code in pretraining #121

Closed bigeagle closed 4 months ago

bigeagle commented 4 months ago

Hi there,

According to the README, DeepSeek-Coder is trained on 2T tokens, where 87% is code. So there should be around 1.74T tokens of code in the pretrain data. However, the tech-report says the training data for code has a data volume of 798 GB.

Let's say, on average, 4-bytes = 1 token, then the code data contains only 200G tokens, which is far from the total 2T tokens for the pretraining.

I'm not sure which part is wrong. Can you explain?

Yours, Justin

guoday commented 4 months ago

about 350B code tokens. These tokens will be trained by 4～5 epochs

bigeagle commented 4 months ago

Thanks!