Closed steventan0110 closed 1 year ago
We have released our paper(https://arxiv.org/abs/2306.05087) and conducted new experiments (swap the order and treat the conflicting results as `Tie') to deal with the conflicting results. But as we observed, GPT-4 and GPT-3.5 also has this issue. Our future work will try to deal with it.
Thanks for the prompt respons, I observed this issue with GPT3.5 as well. Good luck on your future work to deal with this phenomenon!
Hi,
Thanks for releasing the model. I recently have similar findings of this paper (https://arxiv.org/pdf/2305.17926.pdf) when using LLM as evaluator. That is, when I flip the order of response, the model tends to generate completely different decision. I tried PandaLM-7B on my test set and the confliict rate is about 50%. I also tried PandaLM-7B on the test data you released (the
testset-v1
dataset) and I get similar accuracy as you reported. However, the conflict rate is also high, about 17%.Just curious if you have any thoughts on this phenomenon? Maybe something to do for future work?