Reward model in the reinforcement learning process

Hello DeepSeek Team, thanks for your great work!

I fine-tuned your previous DeepSeek-Coder 33B model and got a model which performers well on the HumanEval benchmark. https://github.com/bin123apple/AutoCoder. However, while testing on the HumanEval+ benchmark, the new model's performance is not perfect.

I am thinking it might because that for all the data entries with execution feedback in my dataset, I only covered a small amount of test cases. And I noticed that in your paper, you mentioned that your reward model is trained by using the data provided by the compiler.

ac3f23a28b466d7162bef7632548dda

Is it possible for you to disclose whether the data used to train the reward model included test cases, or if it only required the code to pass the compiler? If test cases were included, could you please provide how many test cases each data entry typically contains?

Thanks again for your great work!

deepseek-ai / DeepSeek-Coder-V2

Reward model in the reinforcement learning process #16