[ ] Database Schemas based on the evalsandmetrics.md
[ ] View an eval
[ ] View different runs of an eval
[ ] Somehow show the source for various different evaluations and have the ability to grab evals by name
[ ] Clarify whether or not we should show the evals in the computation graph on ell studio
[ ] Show the actual scores for a given input on ell studio as opposed to just the mean
[ ] Easy comparison across many models
[ ] Easy to change parameters of individual models in a chain
[ ] UX for showing the model is different
[ ] UX for api params
[ ] Working verbose mode for @function
[ ] Fix ell.function in general
[ ] Support failure modes in metric computation
[ ] Implement parsers/structured outputs to make this cleaner
[ ] Group runs more cleanly so that they are a part of an eval in the invocation view
[ ] Full UX for comparing different evals across any arbitrary axis
[ ] Arbitrary support for failure mode in lmp invocations
[ ] Clarity into why a currently running invocation is working or not
[ ] need to sovle a really good UX for prpompt engineering with no criterion comparion outputs etc
[ ] Open ell studio automatically when the eval gets run...
Next Step TODOS
[ ] Implement a bunch of standard criteria
[ ] Dataset construction needs to be easy and there should be libraries around this, also matching parity with OpenAI evals
(Misc todos)
[ ] #XXX: Seperate this into VersionedEvaluation and Evaluation because versioning is somewhat expensive if someone has a big eval Then perhaps we could default to VersionedEval in the docs or version=False. Not sure.
[ ] TODO: Link Invocations to EvalRuns
[ ] TODO: Link Invocations to INvocationScores.
[ ] TODO: Write to DB
[ ] TODO: Build UX for analyzing evals.
[ ] TODO: Solve (input, labels, score_fn) etc
[ ] TODO: What about automatic cross validation & splitting.
This is a major feature release. Spec: https://github.com/MadcowD/ell/blob/cd64ab9bb0d3a09195fef7a32ef77ac5d7e6c912/docs/ramblings/evalspec.md Ramblings: https://github.com/MadcowD/ell/blob/cd64ab9bb0d3a09195fef7a32ef77ac5d7e6c912/docs/ramblings/thoughtsonevals.md Example: https://github.com/MadcowD/ell/blob/6afad20bc58a99e9f3fe0a76ff6b7642471d63a7/examples/eval.py
UX/IMPL TODOs
Next Step TODOS
(Misc todos)