Low performance on Pubhealth?

HuskyInSalt / CRAG

Corrective Retrieval Augmented Generation

304 stars 29 forks source link

Low performance on Pubhealth? #8

Closed we1k closed 8 months ago

we1k commented 8 months ago

Hi, thanks for sharing great works on RAG. However i notice a problem in paper experiments. The result reported on Pubhealth of LLama2-7b is pretty much lower than any other model. why is the case? I believe this might related to the evalutation code. https://github.com/HuskyInSalt/CRAG/blob/a2692c057b294c93837ac6e2c8d21e53e0f695b5/scripts/eval.py#L92C8-L97C56 While evaluate on the Pubhealth dataset, it strictly needs model to produce true, or false, the Llama2 may just output True or False, leading to zero match for most the cases. By correcting the coding error, i reproduce the experiments on same setting. it get 0.643 exact match accuracy on CRAG llama2 pubQA. which seems more reasonable.

btw, I still get a little uncertain which llama model are paper using? I suppose should be llama-2-7b-chat-hf not llama-2-7b-hf?

HuskyInSalt commented 8 months ago

Hi @we1k ! First, we used llama-2-7b-hf considering that Sefl-RAG is fine-tuned based on it. Improvements of fine-tuning can be directly compared. Second, thanks for your correction, we did not change any process before from the Self-RAG evaluation code to be consistent with Self-RAG. Third, the instructions we used at first did not strictly force the model to generate true or false, therefore the performance was much lower. However, we improved the instructions in this released code. You can reach a better performance than the reported results, which have not been updated yet.

we1k commented 8 months ago

Thanks for your response!

wangpuzhou123 commented 6 months ago

你好，我在PUBhealth上的推理结果只生成了18个结果（true or false），我想这是不够的，因为总共有900多个案例，应该生成900多个推理结果才对。之后我评估时，评估代码报错，我在想可能也是因为推理的结果数量不够造成的。请问你们是如何得到900多条推理结果的，我的错误可能是什么原因导致的呢？谢谢 @HuskyInSalt @we1k

hych0317 commented 1 week ago

你好，我在PUBhealth上的推理结果只生成了18个结果（true or false），我想这是不够的，因为总共有900多个案例，应该生成900多个推理结果才对。之后我评估时，评估代码报错，我在想可能也是因为推理的结果数量不够造成的。请问你们是如何得到900多条推理结果的，我的错误可能是什么原因导致的呢？谢谢 @HuskyInSalt @we1k

对我而言数据集预处理后的结果只有10条(dataset/ref/incorrect),可以在预处理的代码文件里面调整,另外还要调整一下sh文件的ndocs。从后往前倒推回去还是挺容易找到问题的

你这个报错提示写的也很明白，你的bash语句没有--metric这个命令，不是数据量的原因，根本都没开始跑程序