Closed jmercat closed 2 months ago
Hi @Psycoy, thank you for your answer. The link you sent is the closed issue I mentioned. However, in my understanding the proposed solutions are :
potential prompt improvements: is that the pirate speak prompt? in issue #4 it is mentioned that no prompt was used for the results given in the leaderboard. So I did not use any prompt.
Did I miss something else? Given all that what could explain the difference in results?
Hi @Psycoy, thank you for your answer. The link you sent is the closed issue I mentioned. However, in my understanding the proposed solutions are :
- use the correct evaluator which I did
- try running the full split instead of the hard split which I also did
- potential prompt improvements: is that the pirate speak prompt? in issue Default SYSTEM_MESSAGE for Llama 3 Instruct is "You are a pirate chatbot who always responds in pirate speak!" #4 it is mentioned that no prompt was used for the results given in the leaderboard. So I did not use any prompt.
Did I miss something else? Given all that what could explain the difference in results?
Hi @jmercat ,
As mentioned here, the pirate system prompt is the cause to the score difference (the setting on their hf model card). The response here was not very accurate.
Hi @Psycoy, I did try with and without that prompt and got lower results than the leaderboard in both cases. As mentioned above, I did not use any system prompt in the results I gave above:
On the hard split: I had 38.75 instead of 45.6 On the full split: I had 73.7 instead of 75.0
What prompt was used for the results in the leaderboard? Could you please copy paste it here? On the model card I only see the pirate prompt. I tried both no system prompt and pirate prompt and got worse results than the leaderboard ones in both cases.
Hi @jmercat ,
We use the pirate system prompt exactly. As mentioned in the second bullet here, the MixEval-hard scores may differ from the leaderboard scores because we didn't run them again, but MixEval won't. What's the score you get for them with the pirate system prompt?
Hi, I'm opening a new issue because it seems #17 was closed but not resolved and I have a similar issue. I tried reproducing the llama3-8b-instruct results too and got lower results both for hard and full eval splits. I'm using gpt-3.5-turbo-0125 as an evaluator On the hard split: I had 38.75 instead of 45.6 On the full split: I had 73.7 instead of 75.0 I'm using no system prompt (I removed the pirate talk one). Are there any known issues that could cause this?