One of the difficulties with using LLM-as-judge evaluations is trusting their determinations. Therefore, it's important to select an evaluator LLM (different from the LLMs that will be evaluated) that has evaluations that are least likely to be hallucinated or wrong.
This will need to leverage running evaluations on the datasets and metrics we've selected using multiple evaluator LLMs and making a determination based on those results. This will be reliant on "vibes" and manually reviewing of the evaluation scores and reasonings.
Completion Criteria
[x] Compare and contrast LLMs as potential evaluator choices for the MVP
[x] Research the feasibility of utilizing a jury of LLMs instead of a single judge (paper)
[x] Document findings and selection (potential contribution piece to an ADR)
For a first pass, we will use Claude 3.5 Sonnet as an LLM judge. In the future, LLM juries will be explored but a single judge will be used for initial results.
Description
One of the difficulties with using LLM-as-judge evaluations is trusting their determinations. Therefore, it's important to select an evaluator LLM (different from the LLMs that will be evaluated) that has evaluations that are least likely to be hallucinated or wrong.
This will need to leverage running evaluations on the datasets and metrics we've selected using multiple evaluator LLMs and making a determination based on those results. This will be reliant on "vibes" and manually reviewing of the evaluation scores and reasonings.
Completion Criteria