deepseek-ai / DeepSeek-Coder-V2

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
MIT License
1.19k stars 62 forks source link

Reward model in the reinforcement learning process #16

Open bin123apple opened 1 week ago

bin123apple commented 1 week ago

Hello DeepSeek Team, thanks for your great work!

I fine-tuned your previous DeepSeek-Coder 33B model and got a model which performers well on the HumanEval benchmark. https://github.com/bin123apple/AutoCoder. However, while testing on the HumanEval+ benchmark, the new model's performance is not perfect.

I am thinking it might because that for all the data entries with execution feedback in my dataset, I only covered a small amount of test cases. And I noticed that in your paper, you mentioned that your reward model is trained by using the data provided by the compiler.

ac3f23a28b466d7162bef7632548dda

Is it possible for you to disclose whether the data used to train the reward model included test cases, or if it only required the code to pass the compiler? If test cases were included, could you please provide how many test cases each data entry typically contains?

Thanks again for your great work!