According to the README, DeepSeek-Coder is trained on 2T tokens, where 87% is code. So there should be around 1.74T tokens of code in the pretrain data. However, the tech-report says the training data for code has a data volume of 798 GB.
Let's say, on average, 4-bytes = 1 token, then the code data contains only 200G tokens, which is far from the total 2T tokens for the pretraining.
I'm not sure which part is wrong. Can you explain?
Hi there,
According to the README, DeepSeek-Coder is trained on 2T tokens, where 87% is code. So there should be around 1.74T tokens of code in the pretrain data. However, the tech-report says the training data for code has a data volume of 798 GB.
Let's say, on average, 4-bytes = 1 token, then the code data contains only 200G tokens, which is far from the total 2T tokens for the pretraining.
I'm not sure which part is wrong. Can you explain?
Yours, Justin