ZetangForward / L-CITEEVAL

L-CITEEVAL: DO LONG-CONTEXT MODELS TRULY LEVERAGE CONTEXT FOR RESPONDING?
16 stars 2 forks source link

Result Differences #2

Closed TreasureHunter closed 1 week ago

TreasureHunter commented 1 week ago

Hi, excited to see the impressive work you've done. However, I've encountered some issues while trying to replicate your findings. I started by setting up the environment using the code and data from GitHub and Hugging Face, respectively. I prepared the environment as specified in the requirements.txt file. Then, I executed the inference_1shot_vllm script with VLLM, and the model for inference is Llama-3.1-8B-Instruct. Following this, I assessed both the generation and citation quality, yielding the results outlined below.

Paper results in citation quality:

image

My results in citation quality:

1729502357714

Paper results in generation quality:

1729500378638

My results in generation quality:

1729500379658

For tasks that have multiple evaluation files, I've computed the arithmetic mean to account for the varying weights. Upon comparing my findings with those published in your paper, I've noticed a few discrepancies:

  1. The overall trends in generation and citation quality between your paper and my experiment are generally consistent.
  2. In terms of generation quality, there's a significant discrepancy—except for the Dialogue task—where the paper's results and mine differ by about five points in some metrics.
  3. For citation quality, the discrepancy is less pronounced but still noticeable, with a difference of 1-2 points.

I'm eager to hear your insights on this matter! Thank you in advance.

ZetangForward commented 1 week ago

Thank you for your interest in our work. We have just reproduced our results, which are the same as those reported in our paper. Our evaluation results of generation quality are as follows:

img_v3_02fs_2810898f-21d5-43ea-9d92-68a4d71eb26g Where the score of each task is: Narrativeqa img_v3_02fs_83f64341-9d7d-4268-acfa-3dcc02412b8g

natural_questions img_v3_02fs_fb222488-7fda-4b8e-9779-ee3224238eeg

hotpotqa img_v3_02fs_dc648f7b-9241-4513-9d4e-d33f7792584g

2wikimultihopqa img_v3_02fs_197620d3-5bfd-4b27-8641-f3293780e4cg

gov_report img_v3_02fs_59a205dd-9e64-4a79-9ec3-422c684e995g

multi_news img_v3_02fs_80fafe44-609e-4e23-a142-75b92e083ebg

qmsum img_v3_02fs_19c41446-12a3-4726-822d-368b9600348g

We discover that different environments have a significant impact on the results. For example, we notice that your Rouge-L score is quite low. We test the Rouge-L score in another environment, which also includes the rouge-1.1 version of the package. However, this environment contains other rouge-related packages as listed below:

rouge 1.0.1
rouge-chinese 1.0.3
rouge_score 0.1.2

Thus, the results we obtained from this environment are closer to yours:

03ef45af62318456d32cf07d6993130

However, when using the environment we provided, the reproduced results should be as follows(which are the same as those reported in our paper):

b5e6e979ca9b64d9b0ed4e64c7cfee4

Could you kindly provide your environment details?

We have also removed unnecessary dependencies from the original repository requirements to help you troubleshoot the issue more effectively.

TreasureHunter commented 1 week ago

I encountered some issues while setting up the environment. Although I can run the code, I'm unable to verify my rouge version( Using the pip list command, I notice only rouge-chinese 1.0.3 is shown). Consequently, I install the rouge package with version 1.0.1 and re-evaluated the results. This time, the ROUGE-L metric results closely match yours, with a score of 21.52 for the summary task (using rouge 1.0.1 and rouge-chinese 1.0.3). Thanks for your great help!

ZetangForward commented 1 week ago

I encountered some issues while setting up the environment. Although I can run the code, I'm unable to verify my rouge version( Using the pip list command, I notice only rouge-chinese 1.0.3 is shown). Consequently, I install the rouge package with version 1.0.1 and re-evaluated the results. This time, the ROUGE-L metric results closely match yours, with a score of 21.52 for the summary task (using rouge 1.0.1 and rouge-chinese 1.0.3). Thanks for your great help!

Ok, happy to hear that.