Open liuqi8827 opened 6 months ago
"Eurus-7b-kto" performs poorly in coding.
bash run.sh
/Eurus/eval/result/mbpp/result.txt
. It shows:
{'accuracy': 0.0, 'exec_error': 0.0, 'format_error': 100.0}
However, the paper Table 3 reported:
Thus, result.txt
does not match the paper's performance.
Eurus/eval/result/leetcode/samples.jsonl
and /Eurus/eval/result/human_eval/samples.jsonl
.
Can you tell me how to reproduce the performances in the paper Table 3.Hi,
Thanks for your interest and sorry for the trouble.
Re RM: the case in hf is outdated because we previously adopted an incorrect template -- it should be Mistral template ([INST], [/INST]), but we made a typo and used ([INST], [\INST]) in hf page, which leads to incorrect results. I haven't got time to test it again yet but your use case as shown above seems correct.
Re evaluation: The previous version of eval code may be buggy for some reason. We have updated the code and it should be able to reproduce the results.
The Eurus-RM-7b cannot predict the score correctly.
It's output is:
chosen_reward: -626.8788452148438 | rejected_reward: -405.09423828125 | diff: -221.78460693359375
The chosen_reward is smaller than that of rejected_reward. However, it shows that in (https://huggingface.co/openbmb/Eurus-RM-7b), the
Output: 47.4404296875
Can you give me some suggestions?