Closed TreasureHunter closed 1 week ago
Thank you for your interest in our work. We have just reproduced our results, which are the same as those reported in our paper. Our evaluation results of generation quality are as follows:
Where the score of each task is: Narrativeqa
natural_questions
hotpotqa
2wikimultihopqa
gov_report
multi_news
qmsum
We discover that different environments have a significant impact on the results. For example, we notice that your Rouge-L score is quite low. We test the Rouge-L score in another environment, which also includes the rouge-1.1 version of the package. However, this environment contains other rouge-related packages as listed below:
rouge 1.0.1
rouge-chinese 1.0.3
rouge_score 0.1.2
Thus, the results we obtained from this environment are closer to yours:
However, when using the environment we provided, the reproduced results should be as follows(which are the same as those reported in our paper):
Could you kindly provide your environment details?
We have also removed unnecessary dependencies from the original repository requirements to help you troubleshoot the issue more effectively.
I encountered some issues while setting up the environment. Although I can run the code, I'm unable to verify my rouge version( Using the pip list command, I notice only rouge-chinese 1.0.3 is shown). Consequently, I install the rouge package with version 1.0.1 and re-evaluated the results. This time, the ROUGE-L metric results closely match yours, with a score of 21.52 for the summary task (using rouge 1.0.1 and rouge-chinese 1.0.3). Thanks for your great help!
I encountered some issues while setting up the environment. Although I can run the code, I'm unable to verify my rouge version( Using the pip list command, I notice only rouge-chinese 1.0.3 is shown). Consequently, I install the rouge package with version 1.0.1 and re-evaluated the results. This time, the ROUGE-L metric results closely match yours, with a score of 21.52 for the summary task (using rouge 1.0.1 and rouge-chinese 1.0.3). Thanks for your great help!
Ok, happy to hear that.
Hi, excited to see the impressive work you've done. However, I've encountered some issues while trying to replicate your findings. I started by setting up the environment using the code and data from GitHub and Hugging Face, respectively. I prepared the environment as specified in the requirements.txt file. Then, I executed the inference_1shot_vllm script with VLLM, and the model for inference is Llama-3.1-8B-Instruct. Following this, I assessed both the generation and citation quality, yielding the results outlined below.
For tasks that have multiple evaluation files, I've computed the arithmetic mean to account for the varying weights. Upon comparing my findings with those published in your paper, I've noticed a few discrepancies:
I'm eager to hear your insights on this matter! Thank you in advance.