utterances-bot commented 3 weeks ago

Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)

Use cases, techniques, alignment, finetuning, and critiques against LLM-evaluators.

https://eugeneyan.com/writing/llm-evaluators/

snbafana commented 3 weeks ago

thanks for the great article. also, do you think that LLMs as a judge or evaluator can be applied for sentiment scale classification? for ex. if you have a likert scale of 1-5 from completely disagree to completely agree and have an LLM classify some user response on that scale. Do you think this method could yield accurate results?

sher-badshah4672 commented 3 weeks ago

Thank you for the detailed guide on utilizing LLMs as evaluators. In most cases, LLM evaluators are used for either pairwise comparisons or single-answer scoring. For example, they are often employed to evaluate summarization tasks based on predefined criteria (single-answer scoring) or to compare responses from two assistant models, as seen in pairwise comparisons that can be used in DPO (Direct Preference Optimization) for instruct models.

However, I’m curious about how the LLM-as-a-Judge method can be effectively utilized for evaluating factual tasks where no human-written reference answers exist. Factual tasks typically require assessments that are strictly True or False, which doesn’t fit neatly into pairwise or single-answer scoring methods. In such cases, do you have any insights on how to ensure that the LLM-as-a-Judge can reliably distinguish factual accuracy without a reference? Are there any strategies or additional steps that could be employed to enhance the reliability of the evaluations in these scenarios?

Additionally, I came across a paper utilizing the LLM-as-a-Judge approach in the presence of human-written reference answers: https://www.arxiv.org/pdf/2408.09235. However, excluding reference answers in the input to the judge may lead to unreliable evaluations. For example, a model trained in 2022 might not be aware of the latest events, potentially compromising the accuracy of its judgment. How do you think we can address such issues to ensure the evaluations remain reliable, especially when the model’s training data may be outdated?

eugeneyan commented 3 weeks ago

@snbafana

do you think that LLMs as a judge or evaluator can be applied for sentiment scale classification? for ex. if you have a likert scale of 1-5 from completely disagree to completely agree...

If it's strictly sentiment and can be mapped to a binary or multi-class label, I think the regular classification metrics could work well. If it has to be on a likert scale like you suggest, I'd question why it has to be on a scale in the first place, instead of binary agree vs. disagree. Nonetheless, using an LLM-evaluator on a likert scale is possible, though experiments show that it doesn't work as well as returning binary labels.

eugeneyan commented 3 weeks ago

@sher-badshah4672

do you have any insights on how to ensure that the LLM-as-a-Judge can reliably distinguish factual accuracy without a reference? Are there any strategies or additional steps that could be employed to enhance the reliability of the evaluations in these scenarios?

Yes, several of the use cases and techniques covered were reference-free. I think CoT and few-shot prompts can help a lot with reliability. Also see other prompting techniques here.

For example, a model trained in 2022 might not be aware of the latest events, potentially compromising the accuracy of its judgment. How do you think we can address such issues to ensure the evaluations remain reliable, especially when the model’s training data may be outdated?

If the data is fast moving, such as relying on recent events, good search and retrieval can help. That then becomes a problem of evaluating search and information retrieval.

eugeneyan / eugeneyan-comments

https://eugeneyan.com/writing/llm-evaluators/ #85

Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)