(spike) Select an Evaluator LLM for LLM-as-judge Evals

Description

One of the difficulties with using LLM-as-judge evaluations is trusting their determinations. Therefore, it's important to select an evaluator LLM (different from the LLMs that will be evaluated) that has evaluations that are least likely to be hallucinated or wrong.

This will need to leverage running evaluations on the datasets and metrics we've selected using multiple evaluator LLMs and making a determination based on those results. This will be reliant on "vibes" and manually reviewing of the evaluation scores and reasonings.

Completion Criteria

[x] Compare and contrast LLMs as potential evaluator choices for the MVP
[x] Research the feasibility of utilizing a jury of LLMs instead of a single judge (paper)
[x] Document findings and selection (potential contribution piece to an ADR)

defenseunicorns / leapfrogai

(spike) Select an Evaluator LLM for LLM-as-judge Evals #719

Description

Completion Criteria