Arize-ai / phoenix

AI Observability & Evaluation
https://docs.arize.com/phoenix
Other
4.05k stars 299 forks source link

[ENHANCEMENT] make evaluators compatible with experiments api #4827

Open axiomofjoy opened 1 month ago

axiomofjoy commented 1 month ago

Phoenix has two concepts of evaluators, those used in conjunction with llm_classify that are used to annotate spans, and those used as part of our experiments api to annotate dataset examples. Some users want to use the former kind of evaluators with the experiments api.

See here for context.

omrihar commented 1 month ago

To add some more context: specifically the Human vs AI evaluator seems to fit much more the context of Experiments (where there is a ground truth) rather than annotate spans where there isn't a ground truth (suppose, for example, they are collected from a chat application or other application with free-form user interactions). In fact, I'm unsure how to use the Human vs AI evaluator in any other context except for experiments. Anyhow, I think being able to look at the spans generated during the experiment and perhaps annotate them with the span evaluators would be useful, as well as being able to cross-use the span evaluators and the experiment evaluators.