Open athewsey opened 10 months ago
Thanks @athewsey for the feedback.
We recently added recall
over tokens as an evaluation metric (https://github.com/aws/fmeval/pull/157). recall
should not penalize verbose generation as harshly as f1_score
.
On LLM-based metrics: We are looking into adding some along the lines of what you suggested. Though a bit different from the metrics you proposed, one could also include bert_score
from SummarizationAccuracy
evaluation into the QAAccuracy
evaluation.
Thanks for the update! I'd be a bit concerned that neither alternative token-based metrics, nor similarity-based LM ones like bert_score
, fully allow for a divergence between correctness and style/tone/other factors? It seems like even a supposedly-semantic similarity score would be biased by differences in tone, framing, or levels of detail in the answer.
IMO for many use-cases it would be useful to separate evaluation of a system for factual accuracy (trustworthiness / hallucination) versus other factors which are still important, but more about whether it actually enables productivity gains: E.g. is it too verbose, does it correctly cite references, etc.
LLM-based critique provides a natural way to formulate this multi-axis validation by guiding the critic LLM what specific aspects to assess in natural language. Of course it's fair that there'd be concerns about when self-critique metrics might be biased, but I haven't seen any research yet that quantifies those concerns & gives a strong steer to avoid that kind of method... If anybody's aware of any, would love to read it!
The metrics-based approaches in the
QAAccuracy
eval algorithm seem to harshly penalize verbose models (like Claude) on datasets with concise reference answers (like SQuAD).It'd be useful if this library could provide support for LLM-based evaluation of LLM results: For example asking a model whether the reference answer and the generated answer agree or disagree. I'd imagine it working something along the lines of LlamaIndex's
CorrectnessEvaluator
?As I understand it should be possible in theory to implement something like this by building a custom
EvalAlgorithmInterface
-based class, but there are a lot of design questions to consider like:QAAccuracyByLLMCritic
should be a subtype of some broader class? Certainly it'd be interesting to use LLMs to judge other aspects like relevancy, and specific aspects of tone e.g. ("did it discuss my competitor companies XYZ")