Could not reproduce the answer recall for NQ dataset

TIGER-AI-Lab / LongRAG

Official repo for "LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs".

https://tiger-ai-lab.github.io/LongRAG/

MIT License

192 stars 15 forks source link

Could not reproduce the answer recall for NQ dataset #3

Open tyu008 opened 3 months ago

tyu008 commented 3 months ago

Hi, I load nq/full-00000-of-00001.parquet and compute the answer recall based on

answers, context = item["answer"], item["context"] is_retrieval = has_correct_answer(context, answers)

I could only get 0.8532 answer recall, which is below the reported number "88.53" in Table 1 of the paper.

XMHZZ2018 commented 3 months ago

Hi @tyu008, thanks for raising this question. First, the nq/full-00000-of-00001.parquet corresponds to the num_retrieval_units = 4 line, not the num_retrieval_units = 8 line. The QA performance has shown that the ideal long context for existing LLMs is around 30K. If we use a longer context, even if the retrieval performance is higher, the final QA result will degrade (as shown in Figure 3). Therefore, the correct target number is 86.30 rather than 88.53. I will mark this more clearly in the repository.

Second, you are right. The current version's retrieval accuracy is 85.30. There is still a one-point gap between 85.30 and the 86.30 reported in the paper. I suspect I may have uploaded an older version of the final result. I will take a look and upload the new one.

Thanks again for pointing it out!

tyu008 commented 3 months ago

Hi, @XMHZZ2018, thanks so much for your quick reply. I am also trying to reproduce the results from max-P method in NQ dataset. Following the paper, I divide each group into 512-token snippet, and use the maximum snippet similarity as the similarity for the group. But I could only obtain 67.5% answer recall using top-1 group, which is below the reported 71.69 in the paper. I am using the exact same model BAAI/bge-large-en-v1.5 with fp_16. I guess I might have some misalignment with you in chunking the group. May you share the cropped 512-token snippets? Thanks again

XMHZZ2018 commented 3 months ago

@tyu008 Sure! I think I know the main reason here. I avoid cross-document chunking; for example, if a new document starts, it will be in the next chunk. Previously, when I did cross-document chunking, I observed about a 5% to 10% degradation. (This issue becomes even more severe in HotpotQA since the document length is even smaller.) I assume the same thing happened to you. I will upload my chunking file to Hugging face soon so you can reproduce the results. I will ping you here after I finished.

tyu008 commented 3 months ago

@XMHZZ2018 Got it! Thanks a lot for your quick reply! Look forward to your chunking file.