Closed aixiaodewugege closed 4 months ago
collection_name can be filled in as you like, --data_path can be filled in data/crud_split/split_merged.json, --doc_path can be filled in data/80000_docs
Thanks for your reply! I notice that you only run part 1 data in your paper. How should I fill --data_path, --doc_path to reproduce your result?
Besides, will these warning below affect the result?
LLM is explicitly disabled. Using MockLLM. 0%| | 0/2000 [00:00<?, ?it/s]2024-07-01 08:06:20.545 | WARNING | evaluator:task_generation:50 - IndexError('list index out of range') 0%| | 1/2000 [00:04<2:43:46, 4.92s/it]
collection_name can be filled in as you like, --data_path can be filled in data/crud_split/split_merged.json, --doc_path can be filled in data/80000_docs
The result I got is as follow:
{ "info": { "task": "Summary", "llm": "{'model_name': 'qwen7b', 'temperature': 0.1, 'max_new_tokens': 1280, 'top_p': 0.9, 'top_k': 5}" }, "overall": { "avg. bleu-avg": 0.4899373964017888, "avg. bleu-1": 0.7711002085847525, "avg. bleu-2": 0.5501802624743161, "avg. bleu-3": 0.4339695716889377, "avg. bleu-4": 0.35391291064866337, "avg. rouge-L": 0.33885416082864706, "avg. length": 81.59703075291623, "num": 1886 }, }
Is it correct?
The format seems to be fine, but the average length of the generated results seems to be shorter than the results from the paper. What model did you use and what task did you evaluate?
The format seems to be fine, but the average length of the generated results seems to be shorter than the results from the paper. What model did you use and what task did you evaluate?
I use Qwen 7b on summary task. And the score I got is much lower than your paper.
Qwen-7B 28.30 30.21 84.26 67.62 40.03 240.5
Are these warning "evaluator:task_generation:50 - IndexError('list index out of range')" cause it? I think it is not the prompt problem, because bleu score I got is 0.4899, and it is 28.3 in your paper. I am really confused with it.
Can you send your experimental results to my email?
"evaluator:task_generation:50 - IndexError('list index out of range')"
I guess the retrieval database was not established, resulting in no retrieval results. All results were generated by models without retrieved documents.
I have looked at the generated content and it basically confirmed my guess. The bleu metric is unreliable when the generated content is too short. I will help you solve the problem by email
The question has been addressed by email. Please remember to omit "--construct_index" when using the code for the first time, as you need to build a vector index.
The format seems to be fine, but the average length of the generated results seems to be shorter than the results from the paper. What model did you use and what task did you evaluate?
I use Qwen 7b on summary task. And the score I got is much lower than your paper.
Qwen-7B 28.30 30.21 84.26 67.62 40.03 240.5
请问你的问题解决了吗? 我们也发现我们的结果很低
他是没有传递--construct_index参数,导致没有建立检索数据库
Thanks for your brilliant work. Could you please share how to fill these param?And what the meaning of these param?
--data_path 'path/to/dataset' \ --docs_path 'path/to/retrieval_database' \ --collection_name 'name/of/retrieval_database' \