Open axiomofjoy opened 1 month ago
To add some more context: specifically the Human vs AI evaluator seems to fit much more the context of Experiments (where there is a ground truth) rather than annotate spans where there isn't a ground truth (suppose, for example, they are collected from a chat application or other application with free-form user interactions). In fact, I'm unsure how to use the Human vs AI evaluator in any other context except for experiments. Anyhow, I think being able to look at the spans generated during the experiment and perhaps annotate them with the span evaluators would be useful, as well as being able to cross-use the span evaluators and the experiment evaluators.
Phoenix has two concepts of evaluators, those used in conjunction with
llm_classify
that are used to annotate spans, and those used as part of our experiments api to annotate dataset examples. Some users want to use the former kind of evaluators with the experiments api.See here for context.