chen700564 / RGB

Other
276 stars 25 forks source link

Some issues trying Rejection Rate #8

Open alvelvis opened 9 months ago

alvelvis commented 9 months ago

Hi! Congratulations for the amazing work you've done.

I'm trying to use RGB for testing my RAG pipeline, but I'm having the following issue with Rejection Rate: even if I only feed the model with the negative examples, it will still sometimes answer correctly, because it turns out some of the negative examples in fact have the correct answer...

For example:

Question: How much did Elon Musk bought Twitter?
Correct answer (according to RGB): [44 billion]
My model answer: Elon Musk bought Twitter at his original offer price of $54.20 a share, with a total cost of roughly $44 billion.

The model was expected to REJECT to answer this, since I only gave it negative examples, right? But in fact, there are a few negative documents that have this answer:

Is it an expected behavior or am I doing something wrong?

thanks in advance

chen700564 commented 9 months ago

Hi, this is a mistake in our dataset. We identify noise documents by checking if they contain the answer. In this case, the answer contained in the document is "$40 bn", but the annotated golden answer is "44 billion". Our filter rule will regard this document as noise document. Thank you for bringing this to our attention. We will fix some bug like that.

qianzhang2018 commented 9 months ago

Thank you for your excellent work. For the data, I'm in agreement with this one, I found a very large number of examples containing correct answers in the counterexample results, and would like to fix the dataset.

qianzhang2018 commented 9 months ago

比如在zh.json中的第一个案例的反例中,"印寺廟元旦人踩人十二亡     【香港中通社月一日電】印控克什米爾地區一座寺廟一月一日凌晨發生一宗踩踏事故,已造成十二人死亡、二十人受傷。     爭吵推搡釀慘劇     綜合媒體一月一日報道,當天二時四十五分左右,印控克什米爾地區冬季首府查謨附近一座寺廟發生踩踏事故。事發時,從各地趕來的大批信徒正準備進行新年祭拜,由於不少信徒沒有得到進入許可,他們圍擠在寺廟外並發生推搡,最終造成踩踏。", 这个是能够找到正确答案的,正确答案是12人死亡

alvelvis commented 9 months ago

@chen700564 thanks a lot for the answer.

I'm trying a few tweaks in order to test negative rejection in my RAG pipeline. The best approach I've found is not feeding the language model with any of the documents relating to the question, it doesn't matter whether they are positive or negative documents. I'm only feeding it with the other documents from the benchmark, related to other questions, this way I'd really expect the model not to hallucinate and say it doesn't know the correct answer and this is the rejection rate I'm evaluating. There is still one flaw to this approach: some of the questions are similar, so even if I won't feed the model with the documents from a given question, it will receive documents from a question that is similar and that may contain the correct answer. But I think I'll keep in this approach...

Any news on this please let me know :) Best