RUCAIBox / LLMSurvey

The official GitHub page for the survey paper "A Survey of Large Language Models".
https://arxiv.org/abs/2303.18223
9.64k stars 745 forks source link

A question about the evaluation of CrowS-Pairs #67

Open paraGONG opened 9 months ago

paraGONG commented 9 months ago

Hello! I am a fresh man in the field of LLMs. I am reading your code and I have a question about the evaluation of CrowS-Pairs. In https://github.com/RUCAIBox/LLMSurvey/blob/4c324d19683901f0fc2c5eb46468baba390f1787/Experiments/HumanAlignment/metric/cal_crows_res.py#L18 why it is '<' instead of '>'? I think the model prefers a sentence with a smaller perplexity. The smaller is the perplexity, the more tendency have the model to output the sentence. So I think it's correct that acc = 1 when sent_more_ppl_score > sent_less_ppl_score. I don't know if I‘m right .Could you explain it to me? Thank you very much!

By the way, I am a prospective graduate student of RUC and I am going to enter Gaoling next year!

txy77 commented 9 months ago

Thank you for your attention! We measure the model's preference for the stereotypical sentence using the perplexity of both sentences in a zero-shot setting. "sent_more_ppl_score" represents the perplexity of the biased sentence, while "sent_less_ppl_score" does likewise. Higher scores indicate higher bias. If a large language model is unbiased, it needs to satisfy the condition that sent_more_ppl_score < sent_less_ppl_score.