IAAR-Shanghai / CRUD_RAG

CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models
https://arxiv.org/abs/2401.17043
241 stars 20 forks source link

How to fill data_path docs_path ,collection_name? #10

Closed aixiaodewugege closed 4 months ago

aixiaodewugege commented 4 months ago

Thanks for your brilliant work. Could you please share how to fill these param?And what the meaning of these param?

--data_path 'path/to/dataset' \ --docs_path 'path/to/retrieval_database' \ --collection_name 'name/of/retrieval_database' \

haruhi-sudo commented 4 months ago

collection_name can be filled in as you like, --data_path can be filled in data/crud_split/split_merged.json, --doc_path can be filled in data/80000_docs

aixiaodewugege commented 4 months ago

Thanks for your reply! I notice that you only run part 1 data in your paper. How should I fill --data_path, --doc_path to reproduce your result?

Besides, will these warning below affect the result? LLM is explicitly disabled. Using MockLLM. 0%| | 0/2000 [00:00<?, ?it/s]2024-07-01 08:06:20.545 | WARNING | evaluator:task_generation:50 - IndexError('list index out of range') 0%| | 1/2000 [00:04<2:43:46, 4.92s/it]

aixiaodewugege commented 4 months ago

collection_name can be filled in as you like, --data_path can be filled in data/crud_split/split_merged.json, --doc_path can be filled in data/80000_docs

The result I got is as follow: { "info": { "task": "Summary", "llm": "{'model_name': 'qwen7b', 'temperature': 0.1, 'max_new_tokens': 1280, 'top_p': 0.9, 'top_k': 5}" }, "overall": { "avg. bleu-avg": 0.4899373964017888, "avg. bleu-1": 0.7711002085847525, "avg. bleu-2": 0.5501802624743161, "avg. bleu-3": 0.4339695716889377, "avg. bleu-4": 0.35391291064866337, "avg. rouge-L": 0.33885416082864706, "avg. length": 81.59703075291623, "num": 1886 }, }

Is it correct?

haruhi-sudo commented 4 months ago

The format seems to be fine, but the average length of the generated results seems to be shorter than the results from the paper. What model did you use and what task did you evaluate?

aixiaodewugege commented 4 months ago

The format seems to be fine, but the average length of the generated results seems to be shorter than the results from the paper. What model did you use and what task did you evaluate?

I use Qwen 7b on summary task. And the score I got is much lower than your paper. Qwen-7B 28.30 30.21 84.26 67.62 40.03 240.5

aixiaodewugege commented 4 months ago

Are these warning "evaluator:task_generation:50 - IndexError('list index out of range')" cause it? I think it is not the prompt problem, because bleu score I got is 0.4899, and it is 28.3 in your paper. I am really confused with it.

haruhi-sudo commented 4 months ago

Can you send your experimental results to my email?

"evaluator:task_generation:50 - IndexError('list index out of range')"

I guess the retrieval database was not established, resulting in no retrieval results. All results were generated by models without retrieved documents.

haruhi-sudo commented 4 months ago

I have looked at the generated content and it basically confirmed my guess. The bleu metric is unreliable when the generated content is too short. I will help you solve the problem by email

haruhi-sudo commented 4 months ago

The question has been addressed by email. Please remember to omit "--construct_index" when using the code for the first time, as you need to build a vector index.

Syno8 commented 4 months ago

The format seems to be fine, but the average length of the generated results seems to be shorter than the results from the paper. What model did you use and what task did you evaluate?

I use Qwen 7b on summary task. And the score I got is much lower than your paper. Qwen-7B 28.30 30.21 84.26 67.62 40.03 240.5

请问你的问题解决了吗? 我们也发现我们的结果很低

haruhi-sudo commented 4 months ago

他是没有传递--construct_index参数,导致没有建立检索数据库