I am trying to reproduce the results for LLaMA-2-7b on LIMAEval for the discard method.
I ran the evaluation script after generating with the release model KCA_Llama_2_7B_Discarding_Tuning using the default setting which calls gpt-4.
My results were slightly different from the result reported in the paper (30.95).
I initially thought this was caused by the API model update.
However, the snapshot of gpt-4 points to gpt-4-0613 according to the OpenAI documentation.
Do you have a guess at why this might be happening?
For the record, on MS MACRO the ROUGE scores are also slightly off compared to Table 2
Hi, thank you for the interesting work!
I am trying to reproduce the results for LLaMA-2-7b on LIMAEval for the discard method. I ran the evaluation script after generating with the release model
KCA_Llama_2_7B_Discarding_Tuning
using the default setting which callsgpt-4
.My results were slightly different from the result reported in the paper (30.95).
I initially thought this was caused by the API model update. However, the snapshot of
gpt-4
points togpt-4-0613
according to the OpenAI documentation.Do you have a guess at why this might be happening?
For the record, on MS MACRO the ROUGE scores are also slightly off compared to Table 2
Thanks!