Evaluations in ell - Githubissues

UX/IMPL TODOs

[ ] Database Schemas based on the evalsandmetrics.md
[ ] View an eval
[ ] View different runs of an eval
[ ] Somehow show the source for various different evaluations and have the ability to grab evals by name
[ ] Clarify whether or not we should show the evals in the computation graph on ell studio
[ ] Show the actual scores for a given input on ell studio as opposed to just the mean
[ ] Easy comparison across many models
[ ] Easy to change parameters of individual models in a chain
[ ] UX for showing the model is different
[ ] UX for api params
[ ] Working verbose mode for @function
[ ] Fix ell.function in general
[ ] Support failure modes in metric computation
[ ] Implement parsers/structured outputs to make this cleaner
[ ] Group runs more cleanly so that they are a part of an eval in the invocation view
[ ] Full UX for comparing different evals across any arbitrary axis
[ ] Arbitrary support for failure mode in lmp invocations
[ ] Clarity into why a currently running invocation is working or not
[ ] need to sovle a really good UX for prpompt engineering with no criterion comparion outputs etc
[ ] Open ell studio automatically when the eval gets run...

Next Step TODOS

[ ] Implement a bunch of standard criteria
[ ] Dataset construction needs to be easy and there should be libraries around this, also matching parity with OpenAI evals

(Misc todos)

[ ] #XXX: Seperate this into VersionedEvaluation and Evaluation because versioning is somewhat expensive if someone has a big eval Then perhaps we could default to VersionedEval in the docs or version=False. Not sure.
[ ] TODO: Link Invocations to EvalRuns
[ ] TODO: Link Invocations to INvocationScores.
[ ] TODO: Write to DB
[ ] TODO: Build UX for analyzing evals.
[ ] TODO: Solve (input, labels, score_fn) etc
[ ] TODO: What about automatic cross validation & splitting.
[ ] TODO: Consider wandb style metrics later.
[ ] Need a way to compare evals across metrics

MadcowD / ell