Arize-ai / phoenix

AI Observability & Evaluation
https://docs.arize.com/phoenix
Other
3.54k stars 264 forks source link

[ENHANCEMENT] Evaluation relevancy for reranker spans #4728

Open regybean opened 1 week ago

regybean commented 1 week ago

Is your feature request related to a problem? Please describe. To the best of my knowledge, there is no easy way to evaluate the relevancy when a reranker is included in the pipeline. In the case where you want to evaluate the addition of a reranker, there are no document relevancy metrics that can be applied.

Describe the solution you'd like I would like there to be documentation and evaluations which resemble that of a retriever relevancy evaluation. Namely, with ndcg, precision, hit rate and individual document evaluations with explanations as seen in this notebook for the document relevancy evaluation

Describe alternatives you've considered have implemented a messy solution which can retrieve the ndcg, precision and hit rate from the reranker and run the evaluations using llm_classify however when using log_evaluations they cannot be viewed in the phoenixDB in the reranker span. With further code diving, this might be possible to implement but my question is why are there no issues raised, surely this is a common scenario, unless I am missing something?

Additional context Obviously, due to the way the reranker works it does not make sense to conduct relevancy on the retriever as not all of those documents are used in the final query. I understand the reranker score acts as somewhat of a relevancy metric and there may be a way to just use the score however, there would be no way to statistically measure improvement from no-reranker -> reranker as the metric would change. I am currently using llamaindex as my RAG pipeline.

axiomofjoy commented 4 days ago

Hey @regybean, thanks so much for the feedback. Definitely makes sense that you want a way to evaluate re-ranking strategies.