There may occasionally be results from the GPT metrics that say "Failed" or some other response that isn't a number from 1-5. This PR adjusts the aggregate statistics logic to ignore those, which effectively considers them as non-passing. Note that the mean() only reflects existing scores, so pass rate should always be considered.
PR also includes an example run with some failures.
Purpose
There may occasionally be results from the GPT metrics that say "Failed" or some other response that isn't a number from 1-5. This PR adjusts the aggregate statistics logic to ignore those, which effectively considers them as non-passing. Note that the mean() only reflects existing scores, so pass rate should always be considered.
PR also includes an example run with some failures.
Fixes #10
Does this introduce a breaking change?