Arize-ai / phoenix

AI Observability & Evaluation
https://docs.arize.com/phoenix
Other
3.46k stars 255 forks source link

[ENHANCEMENT] Code Eval pass/fail designation #4256

Open chrishart0 opened 1 month ago

chrishart0 commented 1 month ago

Is your feature request related to a problem? Please describe. When creating evals in Phoenix it would be convenient to be able to store / see a pass/fail marker for a specific eval. The phoenix.experiments.types.EvaluationResult type has score and label properties. Storing an optional set of criteria for pass / fail somewhere to evaluate against that score would be useful. For example, there is an out of the box ContainsAnyKeyword code eval as show in this guide, but it outputs a float with no way for the user to know what a good value is.

Describe the solution you'd like Creating some way of storing an evaluation criteria for the score in an EvaluationResult which can then be used to display pass/fail in the UI.

Describe alternatives you've considered I will likely be using pytest to run evals and describing my pass/fail criteria there so I can have a ci/cd.

Additional context The pass/fail criteria could also be graphed if a graphing feature is released for experiments.

mikeldking commented 1 month ago

@chrishart0 makes total sense. We have been un-opinionated to start but things like directionality, pass/fail, boolean - all make sense. Will add to experimentation enhancements.