AkariAsai / self-rag

This includes the original implementation of SELF-RAG: Learning to Retrieve, Generate and Critique through self-reflection by Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi.
https://selfrag.github.io/
MIT License
1.83k stars 170 forks source link

Reproducing the ASQA numbers #4

Open gangiswag opened 1 year ago

gangiswag commented 1 year ago

Hi, I was unable to reproduce the ASQA numbers for long-form generation. After evaluating the output with ALCE, I see the below numbers which are very different from those reported in the paper:

The command I used:

python run_long_form_static.py 
--model_name selfrag/selfrag_llama2_7b --ndocs 5 --max_new_tokens 300 
--threshold 0.2 --use_grounding --use_utility --use_seqscore  --task asqa 
--input_file eval_data/asqa_eval_gtr_top100.json 
--output_file asqa/selfrag_llama2_7b.json --max_depth 7 --mode always_retrieve

I have also uploaded the model output file here for your reference. Just wanted to know whether I am doing anything wrong for ASQA.

Btw, I did a sanity check by evaluating on short-form generation with PopQA and I see 55.0 for accuracy, which matches the number reported in the paper.

AkariAsai commented 1 year ago

Hi, thank you so much for reporting! Hmm, the citation rec and precision particularly look low... Let me check in on this tomorrow.

gangiswag commented 1 year ago

Hi, apologies for pinging again, but just checking in on this to see if you have any update?

Thanks so much!

AkariAsai commented 1 year ago

Sorry for my late response! I was busy with other commitments in the past two weeks. I think the issue might have happened due to some code changes I did for refactoring but I haven't investigated the diffs line by line. Do you mind if I get back to you early next week? I can also upload the model prediction file we have first if it helps!

gangiswag commented 1 year ago

No worries! Early next week sounds good. Yes, having access to the model outputs will be helpful for now :)

AkariAsai commented 1 year ago

Sorry for my late response! This is the link to our 7B prediction results: Google Drive

Here's the output of the asqa eval.py script.

 {
    "length": 29.829113924050635,
    "str_em": 29.957805907172997,
    "str_hit": 8.544303797468354,
    "rougeLsum": 35.7030296755528,
    "QA-EM": 18.568917018284107,
    "QA-F1": 24.01608779257571,
    "QA-Hit": 3.2700421940928273,
    "mauve": 74.3314936476492,
    "citation_rec": 66.96554149085794,
    "citation_prec": 67.81821378340366
}

I'm still investigating the gap in the citation rec and prec, but someone just found a bug in our long-form qa script I mistakenly added during refactoring and I am currently re-running the evaluations. I'll keep you posted!

gangiswag commented 1 year ago

Thanks for sharing this! Please let me know whenever you have updated the long-form QA script and I will try it out again.

XuLingnan commented 11 months ago

Hello, I also encountered a similar situation when reproducing the ASQA numbers for the 13B model, where:

I wonder if you could also share the 13B prediction results. Thanks a lot.

AkariAsai commented 11 months ago

Sorry for being late on this issue as I was being busy with helping to wrap up some other projects and traveling in the past weeks. I can upload the 13B results tomorrow and will take a closer look at the code base.

AkariAsai commented 11 months ago

Here's the 13B predictions (google drive) and results:

{'length': 27.029535864978904, 'str_em': 31.66139240506329, 'str_hit': 8.438818565400844, 'rougeLsum': 36.0146483715914, 'QA-EM': 20.386779184247537, 'QA-F1': 26.404630941269915, 'QA-Hit': 2.9535864978902953, 'mauve': 71.59056482735427, 'citation_rec': 70.35387783805504, 'citation_prec': 71.26280892103678}
Jack-ZC8 commented 7 months ago

Hi, apologies for pinging, it seems like I encountered the same question... I would appreciate it if there is any possible solution!

ShayekhBinIslam commented 6 months ago

@AkariAsai Facing the same issue with ASQA citation precision and recall. Here is the diff between the author output and reproduced output: https://www.diffchecker.com/HLAGTddk/ .

Zg-Serein commented 6 months ago

Hi, I was unable to reproduce the ASQA numbers for long-form generation. After evaluating the output with ALCE, I see the below numbers which are very different from those reported in the paper:

  • 'str_em': 30.05098452883263
  • 'rougeLsum': 34.10838297032821
  • 'mauve': 68.43516667345226
  • 'citation_rec': 50.0210970464135
  • 'citation_prec': 63.60759493670886

The command I used:

python run_long_form_static.py 
--model_name selfrag/selfrag_llama2_7b --ndocs 5 --max_new_tokens 300 
--threshold 0.2 --use_grounding --use_utility --use_seqscore  --task asqa 
--input_file eval_data/asqa_eval_gtr_top100.json 
--output_file asqa/selfrag_llama2_7b.json --max_depth 7 --mode always_retrieve

I have also uploaded the model output file here for your reference. Just wanted to know whether I am doing anything wrong for ASQA.

Btw, I did a sanity check by evaluating on short-form generation with PopQA and I see 55.0 for accuracy, which matches the number reported in the paper.

Hi, I would like to ask you about the retrieval and test Settings of PopQA. I used the retrieval device and Settings in the paper to conduct the search, but the subsequent evaluation accuracy was only 0.42, far lower than the 0.55 reported in the paper. I would like to ask if there are some setup problems in my experiment? Here are the retrieval and test scripts I used: python passage_retrieval.py \ --model_name_or_path facebook/contriever-msmarco --passages psgs_w100.tsv \ --passages_embeddings "wikipedia_embeddings/*" \ --data INPUT_FILE \ --output_dir OUTPUT_FILE \ --n_docs 20

python run_short_form.py \ --model_name ./model/models--selfrag--selfrag_llama2_7b \ --input_file ./ret_out/my_retrieval_output2.jsonl \ --mode adaptive_retrieval --max_new_tokens 100 \ --threshold 0.2 \ --output_file output/out2 \ --metric match --ndocs 20 --use_groundness --use_utility --use_seqscore \ --dtype half > ./log/nohup.my_eval0_20 2>&1 &

aiden-leong commented 3 months ago

Sorry for my late response! This is the link to our 7B prediction results: Google Drive

Here's the output of the asqa eval.py script.

 {
    "length": 29.829113924050635,
    "str_em": 29.957805907172997,
    "str_hit": 8.544303797468354,
    "rougeLsum": 35.7030296755528,
    "QA-EM": 18.568917018284107,
    "QA-F1": 24.01608779257571,
    "QA-Hit": 3.2700421940928273,
    "mauve": 74.3314936476492,
    "citation_rec": 66.96554149085794,
    "citation_prec": 67.81821378340366
}

I'm still investigating the gap in the citation rec and prec, but someone just found a bug in our long-form qa script I mistakenly added during refactoring and I am currently re-running the evaluations. I'll keep you posted!

{
    'length': 29.89873417721519, 
    'str_em': 30.226793248945143, 
    'str_hit': 8.755274261603375, 
    'rougeLsum': 35.75958018700113, 
    'QA-EM': 18.52496483825598, 
    'QA-F1': 24.03806388258978, 
    'QA-Hit': 3.2700421940928273, 
    'mauve': 76.23131396071514, 
    'citation_rec': 50.18811533052039, 
    'citation_prec': 63.92405063291139
}

WX20240724-025402@2x