AkariAsai / self-rag

This includes the original implementation of SELF-RAG: Learning to Retrieve, Generate and Critique through self-reflection by Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi.
https://selfrag.github.io/
MIT License
1.87k stars 173 forks source link

Retrieval-augmented baselines - Huggingface models #61

Open mozhu621 opened 8 months ago

mozhu621 commented 8 months ago

python run_baseline_refactor.py error: python: can't open file 'run_baseline_refactor.py': [Errno 2] No such file or directory This python file doesn't exist, I think it's still run_baseline_lm right, other than that I'm getting very low results from running it, can you give me the command line you ran it on? python run_baseline_lm.py \

--model_name meta-llama/Llama-2-7b-hf \ --input_file eval_data/health_claims_processed.jsonl \ --max_new_tokens 100 --metric match \ --result_fp RESULT_FILE_PATH --task qa \ --mode retrieval \ --prompt_name "prompt_no_input_retrieval" overall result: 0.0070921985815602835

mozhu621 commented 8 months ago

and I run the example about PubHealth, python run_baseline_lm.py \

--model_name meta-llama/Llama-2-7b-hf \ --input_file eval_data/health_claims_processed.jsonl \ --max_new_tokens 20 \ --metric accuracy \ --result_fp llama2_7b_pubhealth_results.json \ --task fever overall result: 0.1702127659574468 is so different in paper table 1, I don't know what happen.

hummingbird2030 commented 7 months ago

and I run the example about PubHealth, python run_baseline_lm.py \

--model_name meta-llama/Llama-2-7b-hf --input_file eval_data/health_claims_processed.jsonl --max_new_tokens 20 --metric accuracy --result_fp llama2_7b_pubhealth_results.json --task fever overall result: 0.1702127659574468 is so different in paper table 1, I don't know what happen.

The same result on pubhealth, which is lower than the result in the paper.

SoseloX commented 6 months ago

Has this issue been resolved?

makexine commented 5 months ago

Has this issue been resolved? I encountered the same problem。