WeOpenML / PandaLM

Apache License 2.0
880 stars 67 forks source link

Conflict Rate is high #18

Closed steventan0110 closed 1 year ago

steventan0110 commented 1 year ago

Hi,

Thanks for releasing the model. I recently have similar findings of this paper (https://arxiv.org/pdf/2305.17926.pdf) when using LLM as evaluator. That is, when I flip the order of response, the model tends to generate completely different decision. I tried PandaLM-7B on my test set and the confliict rate is about 50%. I also tried PandaLM-7B on the test data you released (the testset-v1 dataset) and I get similar accuracy as you reported. However, the conflict rate is also high, about 17%.

Just curious if you have any thoughts on this phenomenon? Maybe something to do for future work?

qianlanwyd commented 1 year ago

We have released our paper(https://arxiv.org/abs/2306.05087) and conducted new experiments (swap the order and treat the conflicting results as `Tie') to deal with the conflicting results. But as we observed, GPT-4 and GPT-3.5 also has this issue. Our future work will try to deal with it.

steventan0110 commented 1 year ago

Thanks for the prompt respons, I observed this issue with GPT3.5 as well. Good luck on your future work to deal with this phenomenon!