SquadV2 results are not reproducible for Llama2-7B

EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.

https://www.eleuther.ai

MIT License

6.48k stars 1.72k forks source link

SquadV2 results are not reproducible for Llama2-7B #982

Closed gupta-abhay closed 10 months ago

gupta-abhay commented 10 months ago

We are working on replicating the Reading Comprehension tasks from the Llama2 paper and have yet to be able to replicate the results reported for the Squad 2.0 task (both 0-shot and few-shot). Currently, the paper does not specify any settings for generation, so we let it run with default settings.

Here is the command we have run:

python main.py --model hf-causal-experimental --model_args pretrained=meta-llama/Llama-2-7b-hf,use_accelerate=True,dtype=bfloat16 --no_cache --tasks squad2

Here are the Exact Match (EM) scores below for 0-shot.

Task	Reported	Reproduced
Squad2	67.2	18.6

haoyuzhao123 commented 10 months ago

facing the same issue, for llama-2-13b-hf, the EM score for Squad2 is 21.8732, and for llama-2-13b-chat-hf, the score is 9.3574.

StellaAthena commented 10 months ago

LLaMA and LLaMA 2 results are, in general, irreproducible. Both papers use custom prompts and other formatting changes that they do not disclose. We have tried to work with Meta to replicate their work using their custom prompts, but they don't want to disclose them.

The severity of this problem can be seen by comparing the LLaMA 1 results reported in the LLaMA 1 and LLaMA 2 papers: LLaMA results are not even reproducible within Meta.

perlitz commented 9 months ago

@gupta-abhay this might be relevant for you: https://github.com/facebookresearch/llama/issues/867 however, I tried a bunch of things and could not get above a 60 EM score

obhalerao97 commented 9 months ago

@StellaAthena @perlitz I have been able to reproduce the EM score for LLama2 using the squad dataset, follow the instructions carefully mentioned in the facebook issue#867

StellaAthena commented 9 months ago

@StellaAthena @perlitz I have been able to reproduce the EM score for LLama2 using the squad dataset, follow the instructions carefully mentioned in the facebook issue#867

Thats awesome! Thanks.

changyuying commented 8 months ago

@StellaAthena @perlitz I have been able to reproduce the EM score for LLama2 using the squad dataset, follow the instructions carefully mentioned in the facebook issue#867

Hello, can you share the script you reproduced? I tried to reproduce using llama-7b, and the results obtained were significantly different from those in the paper.

{
  "exact": 44.091636486145035,
  "f1": 50.06427500339087,
  "total": 11873,
  "HasAns_exact": 40.958164642375166,
  "HasAns_f1": 52.92056968880849,
  "HasAns_total": 5928,
  "NoAns_exact": 47.2161480235492,
  "NoAns_f1": 47.2161480235492,
  "NoAns_total": 5945
}

StellaAthena commented 8 months ago

I assume they used a custom prompt and/or scoring method? But +1 to "please share how exactly you did that."

changyuying commented 8 months ago

I assume they used a custom prompt and/or scoring method? But +1 to "please share how exactly you did that."

I used the InstructionGPT prompt format and the demo sample from the llama github project "example_text_completion. py", to test my results.

StellaAthena commented 8 months ago

@StellaAthena @perlitz I have been able to reproduce the EM score for LLama2 using the squad dataset, follow the instructions carefully mentioned in the facebook issue#867

Please post the exact code to reproduce their scores.

changyuying commented 8 months ago

I used the InstructionGPT prompt format and the demo sample from the llama github project "example_text_completion. py", to test my results.

Please post the exact code to reproduce their scores.

I am currently unable to reproduce their scores, and the scores I have obtained are as follows:

{
  "exact": 44.091636486145035,
  "f1": 50.06427500339087,
  "total": 11873,
  "HasAns_exact": 40.958164642375166,
  "HasAns_f1": 52.92056968880849,
  "HasAns_total": 5928,
  "NoAns_exact": 47.2161480235492,
  "NoAns_f1": 47.2161480235492,
  "NoAns_total": 5945
}

The code I am using is this: https://github.com/facebookresearch/llama/blob/main/example_text_completion.py

The prompt format is exactly the same as https://arxiv.org/pdf/2203.02155.pdf

And I would like to know what method you used to reproduce the scores in the paper. Thank you.

RRaphaell commented 1 month ago

@changyuying have you fixed it? I'm trying to test gpt 4-o on squad v2 using this command:

lm_eval --model openai-chat-completions --model_args model=gpt-4o --tasks squadv2

it returns 0.26 exact score I also test open source models like llama 3 mistral nemo but it seems random for me and its far from paper results

should I change the script or something? thats why I'm using this library to run the evaluation using one command if I'd change and write the code myself then I dont get the idea of this repo 😅