[ENHANCEMENT] make evaluators compatible with experiments api

Arize-ai / phoenix

AI Observability & Evaluation

Other

4.05k stars 299 forks source link

To add some more context: specifically the Human vs AI evaluator seems to fit much more the context of Experiments (where there is a ground truth) rather than annotate spans where there isn't a ground truth (suppose, for example, they are collected from a chat application or other application with free-form user interactions). In fact, I'm unsure how to use the Human vs AI evaluator in any other context except for experiments. Anyhow, I think being able to look at the spans generated during the experiment and perhaps annotate them with the span evaluators would be useful, as well as being able to cross-use the span evaluators and the experiment evaluators.

Arize-ai / phoenix

[ENHANCEMENT] make evaluators compatible with experiments api #4827