[ENHANCEMENT] Code Eval pass/fail designation

Is your feature request related to a problem? Please describe. When creating evals in Phoenix it would be convenient to be able to store / see a pass/fail marker for a specific eval. The phoenix.experiments.types.EvaluationResult type has score and label properties. Storing an optional set of criteria for pass / fail somewhere to evaluate against that score would be useful. For example, there is an out of the box ContainsAnyKeyword code eval as show in this guide, but it outputs a float with no way for the user to know what a good value is.

Describe the solution you'd like Creating some way of storing an evaluation criteria for the score in an EvaluationResult which can then be used to display pass/fail in the UI.

Describe alternatives you've considered I will likely be using pytest to run evals and describing my pass/fail criteria there so I can have a ci/cd.

Additional context The pass/fail criteria could also be graphed if a graphing feature is released for experiments.

Arize-ai / phoenix

[ENHANCEMENT] Code Eval pass/fail designation #4256