Closed younes-io closed 3 months ago
I'm also wondering if this is a "side effect" of the (relatively) long chunks of my docs ? (around 500 tokens).. I don't know if this also impacts the calculation..
@shahules786 : could you please take a look on this please?
Hi @younes-io , this is an interesting but weird result. Will you be able to share a subset of your data so that I can understand well what's going on?
@shahules786 I'm afraid I can't share that since it's private data..
Basically, I have document chunks (say 2) returned by OpenSearch, which contain the answer to the question. The first document contains the response, the second contains a small portion of the answer. The second document is larger than the first.
I'm just wondering if ragas
takes into account the ratio of "relevance to the question / length of the context" in its calculations of context_precision
..
@shahules786 : I have tested using the example in ragas docs
So, I used this dataset:
from datasets import load_dataset
fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")
fiqa_eval
and here's the result:
question | contexts | answer | ground_truths | context_precision | faithfulness | answer_relevancy | context_recall | context_relevancy | |
---|---|---|---|---|---|---|---|---|---|
0 | How to deposit a cheque issued to an associate... | [Just have the associate sign the back and the... | \nThe best way to deposit a cheque issued to a... | [Have the check reissued to the proper payee.J... | 0.0 | 1.0 | 0.938239 | 0.875 | 0.058824 |
1 | Can I send a money order from USPS as a business? | [Sure you can. You can fill in whatever you w... | \nYes, you can send a money order from USPS as... | [Sure you can. You can fill in whatever you w... | 0.0 | 0.8 | 0.885277 | 1.000 | 0.285714 |
2 | 1 EIN doing business under multiple business n... | [You're confusing a lot of things here. Compan... | \nYes, it is possible to have one EIN doing bu... | [You're confusing a lot of things here. Compan... | 0.0 | 0.8 | 0.924754 | 0.000 | 0.083333 |
3 | Applying for and receiving business credit | [Set up a meeting with the bank that handles y... | \nApplying for and receiving business credit c... | ["I'm afraid the great myth of limited liabili... | 0.0 | 1.0 | 0.899104 | 0.500 | 0.333333 |
4 | 401k Transfer After Business Closure | [The time horizon for your 401K/IRA is essenti... | \nIf your employer has closed and you need to ... | [You should probably consult an attorney. Howe... | 0.0 | 0.6 | 0.853572 | 0.000 | 0.043478 |
The context_precision
is "almost" always equal to zero (or holds a near-zero value).
N.B: in the docs, the context precision is not displayed.
@shahules786 : sorry for bothering you, is someone from the team / community able to help on this please ? Thank you
Hi @younes-io , apologies for the late reply. Can you share your ragas version and LLM used?
Also can you try out the same using latest ragas in main ? You can install from source using pip install git+https://github.com/explodinggradients/ragas
@younes-io If you're open for a short call, I would love to help in person. Please book a slot here (early next week)
@shahules786 no worries, I'm also very sorry for the very late reply.. Sure, I'll book a slot!
I ran ragas to evaluate my LangChain-powered chatbot (it's basically a QA chain with document retrieval) and I got the following results.
Of course, the
context_precision
(another form ofcontext_relevancy
which will disappear I think, according to the docs) values are very low (akahorrible
). So, I did some debugging to understand the intermediate calculations (I didn't grasp everything.. but I've got an idea), and I'm wondering how is this situation possible (this is how I interpret it, and correct if I'm wrong):context_recall: 1.00 (can it retrieve all the relevant information required to answer the question: YES) contextprecision: 0.00 (the signal to noise ration of retrieved context: -almost- everything retrieved is Noise_)
For example, I checked that for one answer, this is how the context precision metric evaluated the 2 retrieved documents:
[[ChatGeneration(text='No.', generation_info={'finish_reason': 'stop'}, message=AIMessage(content='No.'))]
Yet, the faithfullness is 1 and the answer relevancy is 0.81.. I'm really confused.. maybe I miss something, but I'd like to understand how to interpret not only each metric independently, but the combinations of their values and what they entail.
Thank you,