defenseunicorns / leapfrogai

Production-ready Generative AI for local, cloud native, airgap, and edge deployments.
https://leapfrog.ai
Apache License 2.0
250 stars 28 forks source link

(spike) Select an Evaluator LLM for LLM-as-judge Evals #719

Closed jalling97 closed 1 week ago

jalling97 commented 2 months ago

Description

One of the difficulties with using LLM-as-judge evaluations is trusting their determinations. Therefore, it's important to select an evaluator LLM (different from the LLMs that will be evaluated) that has evaluations that are least likely to be hallucinated or wrong.

This will need to leverage running evaluations on the datasets and metrics we've selected using multiple evaluator LLMs and making a determination based on those results. This will be reliant on "vibes" and manually reviewing of the evaluation scores and reasonings.

Completion Criteria

jalling97 commented 3 weeks ago

For a first pass, we will use Claude 3.5 Sonnet as an LLM judge. In the future, LLM juries will be explored but a single judge will be used for initial results.