Support/prioritize local prompt metrics

Purpose

This PR adds support for local prompt metrics using Jinja2 templates and the evaluate SDK's PromptMetric.from_template functionality. It also refactors the metric definitions so they have consistently defined aggregation functions.

It makes the review tools more generic:

the summary tool will display any metric that has been seen more than twice
the diff tool will display all numeric metrics

That does mean more scrollbars now.

Does this introduce a breaking change?

[ ] Yes
[X] No

Pull Request Type

What kind of change does this Pull Request introduce?

[ ] Bugfix
[X] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

How to Test

Test against built-in metrics and custom metrics
Try review tools
Make sure learn.com docs still work

Azure-Samples / ai-rag-chat-evaluator

Support/prioritize local prompt metrics #50

Purpose

Does this introduce a breaking change?

Pull Request Type

How to Test