deepseek-ai / DeepSeek-Coder

DeepSeek Coder: Let the Code Write Itself
https://coder.deepseek.com/
MIT License
6.61k stars 461 forks source link

Cutoff dates #89

Closed Naman-ntc closed 9 months ago

Naman-ntc commented 9 months ago

Hi DeepSeek team, Thank you for releasing the amazing DeepSeek models. I am working on LLM evaluations and they lead open-source models (and even quite a few closed-source models)

While I try to construct problems from recently released content (leetcode, github) I wanted to check with you if there are any official cutoff dates claimed for the model. I also realize cutoff dates might vary for the data sources (competition websites, github) possibly arising from pre-training vs instruction tuning gap, and would love to get some clarity on this regard!

Finally, I also wanted to check if there are any plans for releasing more details about the training dataset and sources at some point in a technical report!

pkuzqh commented 9 months ago

Thanks for your interest. The cutoff date of deepseek coder models is March 2023.

Naman-ntc commented 9 months ago

Thanks! Is it for the base models or also for the instruct models?

pkuzqh commented 9 months ago

Both.

Naman-ntc commented 8 months ago

Hi, we have found potential data contamination in leetcode problems released in May-July. Could instruction tuning lead to a later cutoff date? We specifically measure the performance of deepseek on leetcode problems over months and observe a sharp dip in performance after July/August. DeepSeek still outperforms various closed models (on problems released after August) but I wanted to get some clarity on this behavior.