This PR adds support for local prompt metrics using Jinja2 templates and the evaluate SDK's PromptMetric.from_template functionality. It also refactors the metric definitions so they have consistently defined aggregation functions.
It makes the review tools more generic:
the summary tool will display any metric that has been seen more than twice
the diff tool will display all numeric metrics
That does mean more scrollbars now.
Does this introduce a breaking change?
[ ] Yes
[X] No
Pull Request Type
What kind of change does this Pull Request introduce?
[ ] Bugfix
[X] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:
Purpose
This PR adds support for local prompt metrics using Jinja2 templates and the evaluate SDK's PromptMetric.from_template functionality. It also refactors the metric definitions so they have consistently defined aggregation functions.
It makes the review tools more generic:
That does mean more scrollbars now.
Does this introduce a breaking change?
Pull Request Type
What kind of change does this Pull Request introduce?
How to Test