How to interpret the combination of metrics: context precision and the rest (real world example)

younes-io commented 10 months ago

I ran ragas to evaluate my LangChain-powered chatbot (it's basically a QA chain with document retrieval) and I got the following results.

question	ground_truth	faithfulness	answer_relevancy	context_recall	context_relevancy
Q1	GT1	1	0.813637991	1	0.002824859
Q2	GT2	1	0.835290922	0	0.002890173
Q3	GT3	1	0.882307479	1	0.002659574
Q4	GT4	1	0.844765424	0	0.01953125
Q5	GT5	1	0.889618083	1	0.017857143

Of course, the context_precision (another form of context_relevancy which will disappear I think, according to the docs) values are very low (aka horrible). So, I did some debugging to understand the intermediate calculations (I didn't grasp everything.. but I've got an idea), and I'm wondering how is this situation possible (this is how I interpret it, and correct if I'm wrong):

context_recall: 1.00 (can it retrieve all the relevant information required to answer the question: YES) contextprecision: 0.00 (the signal to noise ration of retrieved context: -almost- everything retrieved is Noise_)

For example, I checked that for one answer, this is how the context precision metric evaluated the 2 retrieved documents:

[[ChatGeneration(text='No.', generation_info={'finish_reason': 'stop'}, message=AIMessage(content='No.'))]

Yet, the faithfullness is 1 and the answer relevancy is 0.81.. I'm really confused.. maybe I miss something, but I'd like to understand how to interpret not only each metric independently, but the combinations of their values and what they entail.

Thank you,

younes-io commented 10 months ago

I'm also wondering if this is a "side effect" of the (relatively) long chunks of my docs ? (around 500 tokens).. I don't know if this also impacts the calculation..

younes-io commented 10 months ago

@shahules786 : could you please take a look on this please?

shahules786 commented 10 months ago

Hi @younes-io , this is an interesting but weird result. Will you be able to share a subset of your data so that I can understand well what's going on?

younes-io commented 10 months ago

@shahules786 I'm afraid I can't share that since it's private data.. Basically, I have document chunks (say 2) returned by OpenSearch, which contain the answer to the question. The first document contains the response, the second contains a small portion of the answer. The second document is larger than the first. I'm just wondering if ragas takes into account the ratio of "relevance to the question / length of the context" in its calculations of context_precision..

younes-io commented 10 months ago

@shahules786 : I have tested using the example in ragas docs

So, I used this dataset:

from datasets import load_dataset

fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")
fiqa_eval

and here's the result:

	question	contexts	answer	ground_truths	faithfulness	answer_relevancy	context_recall	context_relevancy
0	How to deposit a cheque issued to an associate...	[Just have the associate sign the back and the...	\nThe best way to deposit a cheque issued to a...	[Have the check reissued to the proper payee.J...	1.0	0.938239	0.875	0.058824
1	Can I send a money order from USPS as a business?	[Sure you can. You can fill in whatever you w...	\nYes, you can send a money order from USPS as...	[Sure you can. You can fill in whatever you w...	0.8	0.885277	1.000	0.285714
2	1 EIN doing business under multiple business n...	[You're confusing a lot of things here. Compan...	\nYes, it is possible to have one EIN doing bu...	[You're confusing a lot of things here. Compan...	0.8	0.924754	0.000	0.083333
3	Applying for and receiving business credit	[Set up a meeting with the bank that handles y...	\nApplying for and receiving business credit c...	["I'm afraid the great myth of limited liabili...	1.0	0.899104	0.500	0.333333
4	401k Transfer After Business Closure	[The time horizon for your 401K/IRA is essenti...	\nIf your employer has closed and you need to ...	[You should probably consult an attorney. Howe...	0.6	0.853572	0.000	0.043478

The context_precision is "almost" always equal to zero (or holds a near-zero value).

N.B: in the docs, the context precision is not displayed.

younes-io commented 10 months ago

@shahules786 : sorry for bothering you, is someone from the team / community able to help on this please ? Thank you

shahules786 commented 10 months ago

Hi @younes-io , apologies for the late reply. Can you share your ragas version and LLM used? Also can you try out the same using latest ragas in main ? You can install from source using pip install git+https://github.com/explodinggradients/ragas

shahules786 commented 10 months ago

@younes-io If you're open for a short call, I would love to help in person. Please book a slot here (early next week)

younes-io commented 9 months ago

@shahules786 no worries, I'm also very sorry for the very late reply.. Sure, I'll book a slot!

explodinggradients / ragas

How to interpret the combination of metrics: context precision and the rest (real world example) #308