Reproducing LIMAEval results

Hi, thank you for the interesting work!

I am trying to reproduce the results for LLaMA-2-7b on LIMAEval for the discard method. I ran the evaluation script after generating with the release model KCA_Llama_2_7B_Discarding_Tuning using the default setting which calls gpt-4.

My results were slightly different from the result reported in the paper (30.95).

"hallucination_judge": { "all_scores": 0.3220338983050847, "error_cnt": 0 ,}

I initially thought this was caused by the API model update. However, the snapshot of gpt-4 points to gpt-4-0613 according to the OpenAI documentation.

Do you have a guess at why this might be happening?

For the record, on MS MACRO the ROUGE scores are also slightly off compared to Table 2

{ "ROUGE-1": 31.27, "ROUGE-2": 20.0, "ROUGE-L": 27.57, "ROUGE-Lsum": 27.7 }

Thanks!

18907305772 / KCA

Reproducing LIMAEval results #3