Closed gupta-abhay closed 10 months ago
facing the same issue, for llama-2-13b-hf, the EM score for Squad2 is 21.8732, and for llama-2-13b-chat-hf, the score is 9.3574.
LLaMA and LLaMA 2 results are, in general, irreproducible. Both papers use custom prompts and other formatting changes that they do not disclose. We have tried to work with Meta to replicate their work using their custom prompts, but they don't want to disclose them.
The severity of this problem can be seen by comparing the LLaMA 1 results reported in the LLaMA 1 and LLaMA 2 papers: LLaMA results are not even reproducible within Meta.
@gupta-abhay this might be relevant for you: https://github.com/facebookresearch/llama/issues/867 however, I tried a bunch of things and could not get above a 60 EM score
@StellaAthena @perlitz I have been able to reproduce the EM score for LLama2 using the squad dataset, follow the instructions carefully mentioned in the facebook issue#867
@StellaAthena @perlitz I have been able to reproduce the EM score for LLama2 using the squad dataset, follow the instructions carefully mentioned in the facebook issue#867
Thats awesome! Thanks.
@StellaAthena @perlitz I have been able to reproduce the EM score for LLama2 using the squad dataset, follow the instructions carefully mentioned in the facebook issue#867
Hello, can you share the script you reproduced? I tried to reproduce using llama-7b, and the results obtained were significantly different from those in the paper.
{
"exact": 44.091636486145035,
"f1": 50.06427500339087,
"total": 11873,
"HasAns_exact": 40.958164642375166,
"HasAns_f1": 52.92056968880849,
"HasAns_total": 5928,
"NoAns_exact": 47.2161480235492,
"NoAns_f1": 47.2161480235492,
"NoAns_total": 5945
}
I assume they used a custom prompt and/or scoring method? But +1 to "please share how exactly you did that."
I assume they used a custom prompt and/or scoring method? But +1 to "please share how exactly you did that."
I used the InstructionGPT prompt format and the demo sample from the llama github project "example_text_completion. py", to test my results.
@StellaAthena @perlitz I have been able to reproduce the EM score for LLama2 using the squad dataset, follow the instructions carefully mentioned in the facebook issue#867
Please post the exact code to reproduce their scores.
I used the InstructionGPT prompt format and the demo sample from the llama github project "example_text_completion. py", to test my results.
Please post the exact code to reproduce their scores.
I am currently unable to reproduce their scores, and the scores I have obtained are as follows:
{
"exact": 44.091636486145035,
"f1": 50.06427500339087,
"total": 11873,
"HasAns_exact": 40.958164642375166,
"HasAns_f1": 52.92056968880849,
"HasAns_total": 5928,
"NoAns_exact": 47.2161480235492,
"NoAns_f1": 47.2161480235492,
"NoAns_total": 5945
}
The code I am using is this: https://github.com/facebookresearch/llama/blob/main/example_text_completion.py
The prompt format is exactly the same as https://arxiv.org/pdf/2203.02155.pdf
And I would like to know what method you used to reproduce the scores in the paper. Thank you.
@changyuying have you fixed it? I'm trying to test gpt 4-o on squad v2 using this command:
lm_eval --model openai-chat-completions --model_args model=gpt-4o --tasks squadv2
it returns 0.26 exact score I also test open source models like llama 3 mistral nemo but it seems random for me and its far from paper results
should I change the script or something? thats why I'm using this library to run the evaluation using one command if I'd change and write the code myself then I dont get the idea of this repo 😅
We are working on replicating the Reading Comprehension tasks from the Llama2 paper and have yet to be able to replicate the results reported for the Squad 2.0 task (both 0-shot and few-shot). Currently, the paper does not specify any settings for generation, so we let it run with default settings.
Here is the command we have run:
Here are the Exact Match (EM) scores below for 0-shot.