18907305772 / KCA

EMNLP'2024: Knowledge Verification to Nip Hallucination in the Bud
https://arxiv.org/abs/2401.10768
Apache License 2.0
1 stars 1 forks source link

Reproducing LIMAEval results #3

Open bangawayoo opened 5 months ago

bangawayoo commented 5 months ago

Hi, thank you for the interesting work!

I am trying to reproduce the results for LLaMA-2-7b on LIMAEval for the discard method. I ran the evaluation script after generating with the release model KCA_Llama_2_7B_Discarding_Tuning using the default setting which calls gpt-4.

My results were slightly different from the result reported in the paper (30.95).

"hallucination_judge": { "all_scores": 0.3220338983050847, "error_cnt": 0 ,}

I initially thought this was caused by the API model update. However, the snapshot of gpt-4 points to gpt-4-0613 according to the OpenAI documentation.

Do you have a guess at why this might be happening?

For the record, on MS MACRO the ROUGE scores are also slightly off compared to Table 2

{ "ROUGE-1": 31.27, "ROUGE-2": 20.0, "ROUGE-L": 27.57, "ROUGE-Lsum": 27.7 }

Thanks!

bangawayoo commented 2 months ago

@18907305772 @fanqiwan Hi, do you have any updates on this?

Thanks.