HuskyInSalt / CRAG

Corrective Retrieval Augmented Generation
298 stars 29 forks source link

Questions about the experiment #17

Open yjoonjang opened 4 months ago

yjoonjang commented 4 months ago

Hi @HuskyInSalt, I saw CRAG very interestingly, and would like to introduce to my lab. However, I have some questions about the experiment.

  1. I can see difference between Table1 and Table2, even though they are same experiments.2. In Table1, the accuracy for PopQA dataset by LLaMA2-hf-7b-CRAG is 39.8. However in Table2, the accuracy for the same dataset and same method is 47.3. Is there some difference? image

    image

  2. I don't get the difference between "LLaMA2-hf-7b + RAG" and "LLaMA2-7B + Baselines with retrieval". Can you explain why these scores differ? image

Thank you.

HuskyInSalt commented 4 months ago

Hi, @yjoonjang ! Thanks for your comment. For question 1, it seems that we might forget to update the performance in Table 1, since the code framework and prompts were continuously improved during the experiments, and Table 1 was released much earlier. I apologize for this if it is true.

As for the question 2, "LLaMA2-7B + Baselines with retrieval" is directly cited from the paper Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection, which is not reproduced by us.

yjoonjang commented 4 months ago

Thank you for your response, @HuskyInSalt . I have few more questions.

  1. Can you let us know the full results for "LLaMA2-7B + CRAG"? I believe results for Pubhealth and ARC are also underscored. (https://github.com/HuskyInSalt/CRAG/issues/8)

  2. Can you tell me the difference between "LLaMA2-7B + Baselines with retrieval" from Self-RAG, and "LLaMA2-hf-7b + RAG" from CRAG paper?