(spike) Evals "Model Card"

jalling97 commented 4 months ago

Description

How the evaluation results get delivered is crucially important. This spike covers what a "model card" would look like for evaluating a model against our framework. The "model card" should help clearly answer the question: "which model should I use for my use-case?"

The model card should incorporate input from design and should convey the most important informational takeaways in a clear and efficient way.

Relevant Links

Galileo Hallucination Index

jalling97 commented 1 month ago

Summary notes from a meeting discussing the Model Card:

Evaluation results would benefit from being output in JSON format for easy use later
We should begin storing historical evaluation results in Github as runtime artifacts
Evaluation data can potentially be visualized in UDS Runtime so Mission heroes can better understand how their deployed instance is performing
We should have the next stage of NIAH evals in place before we can realistically start delivery eval results to mission heroes

jalling97 commented 1 month ago

Decision

The model card will ultimately exist in a few forms:

(near term) A tabular representation that shows for a given model (or hyperparameter configuration) as a row, the columns consist of all of the scored metrics that were applied to that configuration.
(long term) A deployed instance of LeapfrogAI will likely always accompany UDS runtime. The evaluation results for a deployment will live in a table under its corresponding UDS runtime page.

A model card report will consist of the table of evaluation metrics as well as a written summary of what the metrics mean, how they relate to specific performance considerations, as well as model recommendations. Therefore, this report can be generalized for a wide audience, but will need to be customized for a given potential deployment scenario.

defenseunicorns / leapfrogai