Azure-Samples / ai-rag-chat-evaluator

Tools for evaluation of RAG Chat Apps using Azure AI Evaluate SDK and OpenAI
MIT License
209 stars 75 forks source link

Handle failures with numeric ratings #53

Closed pamelafox closed 7 months ago

pamelafox commented 7 months ago

Purpose

There may occasionally be results from the GPT metrics that say "Failed" or some other response that isn't a number from 1-5. This PR adjusts the aggregate statistics logic to ignore those, which effectively considers them as non-passing. Note that the mean() only reflects existing scores, so pass rate should always be considered.

PR also includes an example run with some failures.

Fixes #10

Does this introduce a breaking change?

[ ] Yes
[X] No