Closed we1k closed 8 months ago
Hi @we1k ! First, we used llama-2-7b-hf
considering that Sefl-RAG is fine-tuned based on it. Improvements of fine-tuning can be directly compared.
Second, thanks for your correction, we did not change any process before from the Self-RAG evaluation code to be consistent with Self-RAG.
Third, the instructions we used at first did not strictly force the model to generate true
or false
, therefore the performance was much lower. However, we improved the instructions in this released code. You can reach a better performance than the reported results, which have not been updated yet.
Thanks for your response!
你好,我在PUBhealth上的推理结果只生成了18个结果(true or false),我想这是不够的,因为总共有900多个案例,应该生成900多个推理结果才对。之后我评估时,评估代码报错,我在想可能也是因为推理的结果数量不够造成的。 请问你们是如何得到900多条推理结果的,我的错误可能是什么原因导致的呢?谢谢 @HuskyInSalt @we1k
你好,我在PUBhealth上的推理结果只生成了18个结果(true or false),我想这是不够的,因为总共有900多个案例,应该生成900多个推理结果才对。之后我评估时,评估代码报错,我在想可能也是因为推理的结果数量不够造成的。请问你们是如何得到900多条推理结果的,我的错误可能是什么原因导致的呢?谢谢 @HuskyInSalt @we1k
对我而言数据集预处理后的结果只有10条(dataset/ref/incorrect),可以在预处理的代码文件里面调整,另外还要调整一下sh文件的ndocs。从后往前倒推回去还是挺容易找到问题的
你这个报错提示写的也很明白,你的bash语句没有--metric这个命令,不是数据量的原因,根本都没开始跑程序
Hi, thanks for sharing great works on RAG. However i notice a problem in paper experiments. The result reported on Pubhealth of LLama2-7b is pretty much lower than any other model. why is the case? I believe this might related to the evalutation code. https://github.com/HuskyInSalt/CRAG/blob/a2692c057b294c93837ac6e2c8d21e53e0f695b5/scripts/eval.py#L92C8-L97C56 While evaluate on the Pubhealth dataset, it strictly needs model to produce
true
, orfalse
, the Llama2 may just outputTrue
orFalse
, leading to zero match for most the cases. By correcting the coding error, i reproduce the experiments on same setting. it get 0.643 exact match accuracy on CRAG llama2 pubQA. which seems more reasonable.btw, I still get a little uncertain which llama model are paper using? I suppose should be
llama-2-7b-chat-hf
notllama-2-7b-hf
?