Non-reproducible results

Psycoy / MixEval

The official evaluation suite and dynamic data release for MixEval.

https://mixeval.github.io/

222 stars 34 forks source link

Non-reproducible results #35

Closed jmercat closed 2 months ago

jmercat commented 2 months ago

Hi, I'm opening a new issue because it seems #17 was closed but not resolved and I have a similar issue. I tried reproducing the llama3-8b-instruct results too and got lower results both for hard and full eval splits. I'm using gpt-3.5-turbo-0125 as an evaluator On the hard split: I had 38.75 instead of 45.6 On the full split: I had 73.7 instead of 75.0 I'm using no system prompt (I removed the pirate talk one). Are there any known issues that could cause this?

Psycoy commented 2 months ago

Hi @jmercat ,

Please refer to here for the cause.

Thanks!

jmercat commented 2 months ago

Hi @Psycoy, thank you for your answer. The link you sent is the closed issue I mentioned. However, in my understanding the proposed solutions are :

use the correct evaluator which I did
try running the full split instead of the hard split which I also did
potential prompt improvements: is that the pirate speak prompt? in issue #4 it is mentioned that no prompt was used for the results given in the leaderboard. So I did not use any prompt.

Did I miss something else? Given all that what could explain the difference in results?

Psycoy commented 2 months ago

Hi @Psycoy, thank you for your answer. The link you sent is the closed issue I mentioned. However, in my understanding the proposed solutions are :

use the correct evaluator which I did

try running the full split instead of the hard split which I also did

potential prompt improvements: is that the pirate speak prompt? in issue Default SYSTEM_MESSAGE for Llama 3 Instruct is "You are a pirate chatbot who always responds in pirate speak!" #4 it is mentioned that no prompt was used for the results given in the leaderboard. So I did not use any prompt.

Did I miss something else? Given all that what could explain the difference in results?

Hi @jmercat ,

As mentioned here, the pirate system prompt is the cause to the score difference (the setting on their hf model card). The response here was not very accurate.

jmercat commented 2 months ago

Hi @Psycoy, I did try with and without that prompt and got lower results than the leaderboard in both cases. As mentioned above, I did not use any system prompt in the results I gave above:

On the hard split: I had 38.75 instead of 45.6 On the full split: I had 73.7 instead of 75.0

What prompt was used for the results in the leaderboard? Could you please copy paste it here? On the model card I only see the pirate prompt. I tried both no system prompt and pirate prompt and got worse results than the leaderboard ones in both cases.

Psycoy commented 2 months ago

Hi @jmercat ,

We use the pirate system prompt exactly. As mentioned in the second bullet here, the MixEval-hard scores may differ from the leaderboard scores because we didn't run them again, but MixEval won't. What's the score you get for them with the pirate system prompt?