Handle failures with numeric ratings

Purpose

There may occasionally be results from the GPT metrics that say "Failed" or some other response that isn't a number from 1-5. This PR adjusts the aggregate statistics logic to ignore those, which effectively considers them as non-passing. Note that the mean() only reflects existing scores, so pass rate should always be considered.

PR also includes an example run with some failures.

Fixes #10

Does this introduce a breaking change?

[ ] Yes
[X] No

Azure-Samples / ai-rag-chat-evaluator

Handle failures with numeric ratings #53

Purpose

Does this introduce a breaking change?