aws / fmeval

Foundation Model Evaluations Library
http://aws.github.io/fmeval
Apache License 2.0
208 stars 46 forks source link

[Feature] LLM-based (QA Accuracy) eval algorithm #163

Open athewsey opened 10 months ago

athewsey commented 10 months ago

The metrics-based approaches in the QAAccuracy eval algorithm seem to harshly penalize verbose models (like Claude) on datasets with concise reference answers (like SQuAD).

It'd be useful if this library could provide support for LLM-based evaluation of LLM results: For example asking a model whether the reference answer and the generated answer agree or disagree. I'd imagine it working something along the lines of LlamaIndex's CorrectnessEvaluator?

As I understand it should be possible in theory to implement something like this by building a custom EvalAlgorithmInterface-based class, but there are a lot of design questions to consider like:

bilalaws commented 10 months ago

Thanks @athewsey for the feedback.

We recently added recall over tokens as an evaluation metric (https://github.com/aws/fmeval/pull/157). recall should not penalize verbose generation as harshly as f1_score.

On LLM-based metrics: We are looking into adding some along the lines of what you suggested. Though a bit different from the metrics you proposed, one could also include bert_score from SummarizationAccuracy evaluation into the QAAccuracy evaluation.

athewsey commented 9 months ago

Thanks for the update! I'd be a bit concerned that neither alternative token-based metrics, nor similarity-based LM ones like bert_score, fully allow for a divergence between correctness and style/tone/other factors? It seems like even a supposedly-semantic similarity score would be biased by differences in tone, framing, or levels of detail in the answer.

IMO for many use-cases it would be useful to separate evaluation of a system for factual accuracy (trustworthiness / hallucination) versus other factors which are still important, but more about whether it actually enables productivity gains: E.g. is it too verbose, does it correctly cite references, etc.

LLM-based critique provides a natural way to formulate this multi-axis validation by guiding the critic LLM what specific aspects to assess in natural language. Of course it's fair that there'd be concerns about when self-critique metrics might be biased, but I haven't seen any research yet that quantifies those concerns & gives a strong steer to avoid that kind of method... If anybody's aware of any, would love to read it!