Performance on BM25 retrieval baseline

sutakori commented 2 years ago

I am running run_eval_rag_re.sh on BM25 baseline and seeing a much high result on retrieval results,

4168it [08:02,  9.60it/s]INFO:__main__:Using BM25 for retrieval
4176it [08:02,  9.88it/s]INFO:__main__:Using BM25 for retrieval
4184it [08:03,  8.91it/s]INFO:__main__:Using BM25 for retrieval
4192it [08:04, 10.06it/s]INFO:__main__:Using BM25 for retrieval
4201it [08:05,  8.65it/s]
INFO:__main__:Using BM25 for retrieval
INFO:__main__:Doc_Prec@1:  43.18
INFO:__main__:Doc_Prec@5:  67.20
INFO:__main__:Doc_Prec@10:  74.53
INFO:__main__:Pid_Prec@5:  19.45
INFO:__main__:Pid_Prec@5:  40.75
INFO:__main__:Pid_Prec@10:  48.56
INFO:__main__:all:  43.18 &  67.20 &  74.53  &  19.45 &  40.75 &  48.56 &

Settings: domain=all seg=token score=original task=grounding split=val

Additional parameters: --bm25 ../data/mdd_kb/mdd-$seg-$domain.csv

Input files are generated by predecessor scripts with same settings. Datas are generated by run_data_preprocessing.sh. Index files are generated by run_kb_index.sh. Checkpoints are generated by run_finetune_rag.sh, with DPR checkpoints generated by run_converter.sh on finetuned DPR checkpoints. (And if I am not mistaken, although required by the code, RAG checkpoints will not affect the results of run_eval_rag_re.sh with bm25 given).

So any mistake in my usage or understanding?

By the way, I am a bit confusing on the grounding span generation task (Table 4) in the paper. Does it correspond to the result of run_eval_rag_re.sh? But it dosen't contain F1, EM and BL. And does the D^token-rr-cls-ft means joint training of DPR question encoder and RAG generator, while D^token-ft use finetuned DPR directly? I would be appreciated if you could clarify my confusions.

songfeng commented 2 years ago

Thank you for the questions!

For the retrieval results reported in the papers, they are all passage retrieval results (i.e., "Pid_Prec@n") not document retrieval (i.e., "Doc_Prec@n" in the output). The passage results you got is quite comparable to the last three columns in Table 4.
run_eval_rag_re.sh only provides the retrieval results. For text generation evaluation scores (F1, EM, BL in Table 4) , please refer to run_eval_rag_e2e.sh
D^token-*-ft means that we use finetuned-DPR encoders for the document index (run_kb_index.sh) and the biencoder for Retriever Module in RAG (run_converter.sh).

sutakori commented 2 years ago

Thanks for your reply! So if I am not mistaken, Table 5 is from run_eval_rag_re.sh, and Table 4&6 are from run_eval_rag_e2e.sh, with task set as grounding&generation, is that right? I mistakenly thought D^token-ft as DPR and D^token-rr-cls-ft as RAG, and so they are all RAG? And I am still confusing of the difference between D^token-ft and the *-rr-*.

songfeng commented 2 years ago

Table 4 and 6 contain retrieval evaluation (run_eval_rag_re.sh) and text generation evaluation (run_eval_rag_e2e.sh) results by RAG models. Table 5 is DPR retrieval results, not RAG.
-rr- corresponds to reranking the retrieved passages by RAG retriever based on the retrieval results by only the current turn, where the embedding of the current turn is based on [CLS] (rr-cls) or pooled (rr-pl). Please see Paper Section 3.2 and code as a reference.

sutakori commented 2 years ago

Ok, I've got it, thank you for your prompt reply!

IBM / multidoc2dial

Performance on BM25 retrieval baseline #3