explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
6.61k stars 648 forks source link

How to interpret the combination of metrics: context precision and the rest (real world example) #308

Closed younes-io closed 3 months ago

younes-io commented 10 months ago

I ran ragas to evaluate my LangChain-powered chatbot (it's basically a QA chain with document retrieval) and I got the following results.

question ground_truth faithfulness answer_relevancy context_recall context_precision context_relevancy
Q1 GT1 1 0.813637991 1 0 0.002824859
Q2 GT2 1 0.835290922 0 0 0.002890173
Q3 GT3 1 0.882307479 1 0 0.002659574
Q4 GT4 1 0.844765424 0 0 0.01953125
Q5 GT5 1 0.889618083 1 0 0.017857143

Of course, the context_precision (another form of context_relevancy which will disappear I think, according to the docs) values are very low (aka horrible). So, I did some debugging to understand the intermediate calculations (I didn't grasp everything.. but I've got an idea), and I'm wondering how is this situation possible (this is how I interpret it, and correct if I'm wrong):

context_recall: 1.00 (can it retrieve all the relevant information required to answer the question: YES) contextprecision: 0.00 (the signal to noise ration of retrieved context: -almost- everything retrieved is Noise_)

For example, I checked that for one answer, this is how the context precision metric evaluated the 2 retrieved documents:

[[ChatGeneration(text='No.', generation_info={'finish_reason': 'stop'}, message=AIMessage(content='No.'))]

Yet, the faithfullness is 1 and the answer relevancy is 0.81.. I'm really confused.. maybe I miss something, but I'd like to understand how to interpret not only each metric independently, but the combinations of their values and what they entail.

Thank you,

younes-io commented 10 months ago

I'm also wondering if this is a "side effect" of the (relatively) long chunks of my docs ? (around 500 tokens).. I don't know if this also impacts the calculation..

younes-io commented 10 months ago

@shahules786 : could you please take a look on this please?

shahules786 commented 10 months ago

Hi @younes-io , this is an interesting but weird result. Will you be able to share a subset of your data so that I can understand well what's going on?

younes-io commented 10 months ago

@shahules786 I'm afraid I can't share that since it's private data.. Basically, I have document chunks (say 2) returned by OpenSearch, which contain the answer to the question. The first document contains the response, the second contains a small portion of the answer. The second document is larger than the first. I'm just wondering if ragas takes into account the ratio of "relevance to the question / length of the context" in its calculations of context_precision..

younes-io commented 10 months ago

@shahules786 : I have tested using the example in ragas docs

So, I used this dataset:

from datasets import load_dataset

fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")
fiqa_eval

and here's the result:

question contexts answer ground_truths context_precision faithfulness answer_relevancy context_recall context_relevancy
0 How to deposit a cheque issued to an associate... [Just have the associate sign the back and the... \nThe best way to deposit a cheque issued to a... [Have the check reissued to the proper payee.J... 0.0 1.0 0.938239 0.875 0.058824
1 Can I send a money order from USPS as a business? [Sure you can. You can fill in whatever you w... \nYes, you can send a money order from USPS as... [Sure you can. You can fill in whatever you w... 0.0 0.8 0.885277 1.000 0.285714
2 1 EIN doing business under multiple business n... [You're confusing a lot of things here. Compan... \nYes, it is possible to have one EIN doing bu... [You're confusing a lot of things here. Compan... 0.0 0.8 0.924754 0.000 0.083333
3 Applying for and receiving business credit [Set up a meeting with the bank that handles y... \nApplying for and receiving business credit c... ["I'm afraid the great myth of limited liabili... 0.0 1.0 0.899104 0.500 0.333333
4 401k Transfer After Business Closure [The time horizon for your 401K/IRA is essenti... \nIf your employer has closed and you need to ... [You should probably consult an attorney. Howe... 0.0 0.6 0.853572 0.000 0.043478

The context_precision is "almost" always equal to zero (or holds a near-zero value).

N.B: in the docs, the context precision is not displayed.

younes-io commented 10 months ago

@shahules786 : sorry for bothering you, is someone from the team / community able to help on this please ? Thank you

shahules786 commented 10 months ago

Hi @younes-io , apologies for the late reply. Can you share your ragas version and LLM used? Also can you try out the same using latest ragas in main ? You can install from source using pip install git+https://github.com/explodinggradients/ragas

shahules786 commented 10 months ago

@younes-io If you're open for a short call, I would love to help in person. Please book a slot here (early next week)

younes-io commented 9 months ago

@shahules786 no worries, I'm also very sorry for the very late reply.. Sure, I'll book a slot!