Is your feature request related to a problem? Please describe.
When creating evals in Phoenix it would be convenient to be able to store / see a pass/fail marker for a specific eval. The phoenix.experiments.types.EvaluationResult type has score and label properties. Storing an optional set of criteria for pass / fail somewhere to evaluate against that score would be useful. For example, there is an out of the box ContainsAnyKeyword code eval as show in this guide, but it outputs a float with no way for the user to know what a good value is.
Describe the solution you'd like
Creating some way of storing an evaluation criteria for the score in an EvaluationResult which can then be used to display pass/fail in the UI.
Describe alternatives you've considered
I will likely be using pytest to run evals and describing my pass/fail criteria there so I can have a ci/cd.
Additional context
The pass/fail criteria could also be graphed if a graphing feature is released for experiments.
@chrishart0 makes total sense. We have been un-opinionated to start but things like directionality, pass/fail, boolean - all make sense. Will add to experimentation enhancements.
Is your feature request related to a problem? Please describe. When creating evals in Phoenix it would be convenient to be able to store / see a pass/fail marker for a specific eval. The
phoenix.experiments.types.EvaluationResult
type hasscore
andlabel
properties. Storing an optional set of criteria for pass / fail somewhere to evaluate against that score would be useful. For example, there is an out of the boxContainsAnyKeyword
code eval as show in this guide, but it outputs a float with no way for the user to know what a good value is.Describe the solution you'd like Creating some way of storing an evaluation criteria for the score in an
EvaluationResult
which can then be used to display pass/fail in the UI.Describe alternatives you've considered I will likely be using pytest to run evals and describing my pass/fail criteria there so I can have a ci/cd.
Additional context The pass/fail criteria could also be graphed if a graphing feature is released for experiments.