(Non) Reproducible Experiment Results

Psycoy / MixEval

The official evaluation suite and dynamic data release for MixEval.

https://mixeval.github.io/

196 stars 28 forks source link

(Non) Reproducible Experiment Results #17

Closed carstendraschner closed 2 months ago

carstendraschner commented 2 months ago

Hi,

I tried to reproduce the experiment results on a A100 while using Azure open AI API with GPT-35-Turbo-1106 as judge:

for Mistral7B it was fine for LLAMA8B it was: 0.39(mine) vs 0.46(yours). Do you have an Idea why it is this far off? I tried also with Nous Research version of LLAMA3 8B instruct

I faced the same with some other models. The system prompt had only very small influence when used

Psycoy commented 2 months ago

Hi,

The score fluctuations comes from the below possible reasons:

The scores are parsed with GPT-35-Turbo-0125 instead of GPT-35-Turbo-1106. The same model parser is tested to be stable, while different parsers are not.
The MixEval-hard scores are computed directly based on the MixEval model results to save budgets, since MixEval-hard is a subset of the MixEval. Small score difference may happen due to this (this is equivalent to using different batch sizes and random seeds).
The system prompt used in llama-3-8b-instruct's official model card achieves a higher score. Check here.

carstendraschner commented 2 months ago

Hi thanks for your response :) Could you please directly pin the line of file where I can find the system prompt you are referring to? thank you very much!

Psycoy commented 2 months ago

You can refer to the "How to use" section of their model card

jmercat commented 1 week ago

Hi, I have a similar issue. I tried reproducing the llama3-8b-instruct results too and got lower results both for hard and full eval splits. I'm using gpt-3.5-turbo-0125 as an evaluator On the hard split: I had 38.75 instead of 45.6 On the full split: I had 73.7 instead of 75.0 I'm using no system prompt (I removed the pirate talk one). Are there any known issues that could cause this?