confident-ai / deepeval

The LLM Evaluation Framework
https://docs.confident-ai.com/
Apache License 2.0
3.13k stars 244 forks source link

retrieval_context #483

Open Jiajing-Chen opened 7 months ago

Jiajing-Chen commented 7 months ago

Hi, I'm working with the LLMTestCase example:

test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    # Substitute this with the actual output from your LLM tool
    actual_output="We offer a 30-day full refund at no extra costs.",
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)

I observed that retrieval_context is specified as a list containing a single string. This leads me to ask:

  1. Is it possible to expand this single-string list into a list containing several strings?
  2. Why is retrieval_context designed to accept a list of strings? What should be the nature of the relationship among the strings within the retrieval_context?
penguine-ip commented 7 months ago

@Jiajing-Chen it is possible, and we can add support to a list[list[str]] if that's what you're talking about. But can I understand why you need a nested list of strings?

Retrieval context is a list of strings since that is what a retriever in a RAG pipeline outputs. It returns a list of text chunks for your LLM as a context.

Jiajing-Chen commented 7 months ago

Hi Jeffrey @penguine-ip, thanks for your reply. For the first question, I was asking whether we can modify:

retrieval_context=["string1"]

to

retrieval_context=["string1", "string2", ..., " string_n"]

As you mentioned that:

Retrieval context is a list of strings since that is what a retriever in a RAG pipeline outputs.

I tried to modify the length of retrieval_context from 1 to n (actually 2 for now) without modifying any other parts, but I got an error:

File "/Users/jiajingc/anaconda3/lib/python3.11/site-packages/deepeval/metrics/faithfulness.py", line 56, in measure self.truths: List[str] = future_truths.result() ^^^^^^^^^^^^^^^^^^^^^^ File "/Users/jiajingc/anaconda3/lib/python3.11/concurrent/futures/_base.py", line 456, in result return self.get_result() ^^^^^^^^^^^^^^^^^^^ File "/Users/jiajingc/anaconda3/lib/python3.11/concurrent/futures/_base.py", line 401, in get_result raise self._exception File "/Users/jiajingc/anaconda3/lib/python3.11/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/jiajingc/anaconda3/lib/python3.11/site-packages/deepeval/metrics/faithfulness.py", line 122, in _generate_truths data = json.loads(json_output) ^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/jiajingc/anaconda3/lib/python3.11/json/init.py", line 346, in loads return _default_decoder.decode(s) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/jiajingc/anaconda3/lib/python3.11/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/jiajingc/anaconda3/lib/python3.11/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

So I am just wondering whether I also need to change other part to make some lengths matched with the length of context?

Thanks, Jiajing Chen

penguine-ip commented 7 months ago

@Jiajing-Chen I see, it seems like the problem is your LLM isn't outputting a valid json, which is causing an error during decoding. Which evaluation model are you using, and what is the value of n?

I would also advise trying with shorter lengths and see how it goes.

You can definitely make it [string 1, ... , string n]

Jiajing-Chen commented 7 months ago

Hi @penguine-ip , thanks for your explanation. I am using "gpt-4-1106-preview", and for now n is 2. I am just testing this alternative:

retrieval_context=["context1"], #works well

and

retrieval_context = ["context1", "context2"], #json.decoder.JSONDecodeError:

to make sure that the contexts are short enough to be handled.

penguine-ip commented 7 months ago

That's super weird, i'll reproduce it then get back to you, thanks for flagging @Jiajing-Chen

penguine-ip commented 7 months ago

@Jiajing-Chen replied on discord

omarjaz commented 6 months ago

Hi! I am trying to compute Contextual Relevancy Metric by passing several retrieved contexts (retrieval_context = ["context1", "context2"]). I wonder why it returns only one score. Should not it return two numbers (one score per context)? Thanks in advance!

penguine-ip commented 6 months ago

hey @omarjaz , we're looking at the overall relevancy of the retrieval context, so not giving any distinction to each retrieval context: https://docs.confident-ai.com/docs/metrics-contextual-relevancy#how-is-it-calculated

I guess to answer your question, it would be because it fits in with the patterns of the other metrics. If you're looking for a score that considers each individual node, i would highly recommend precision. It also takes note of the position of each node and the score will be higher if the a relevant node is higher up top. Docs: https://docs.confident-ai.com/docs/metrics-contextual-precision#how-is-it-calculated

omarjaz commented 6 months ago

Many thanks! So, Context Relevance Precision was what I was looking for! However, I am trying to compute it on a json dataset and I can't because retrieval_context_key_name parameter is not available to load the json file as an EvaluationDataset :

` from deepeval.dataset import EvaluationDataset

 dataset = EvaluationDataset()

dataset.add_test_cases_from_json_file(

# file_path is the absolute path to you .json file

file_path="example.json",

input_key_name="query",

actual_output_key_name="actual_output",

expected_output_key_name="expected_output",

context_key_name="context",
)`

How can I address this problem?

penguine-ip commented 6 months ago

https://docs.confident-ai.com/docs/evaluation-datasets#from-json

Updated in latest version @omarjaz

omarjaz commented 6 months ago

Thanks! Did not see that update. By the way, will all of these LLM-Eval metrics work fine for Spanish texts ? If not, is it possible to change the prompt of the LLM evaluation model (like this one : [(https://github.com/confident-ai/deepeval/blob/main/deepeval/metrics/faithfulness/template.py])

penguine-ip commented 6 months ago

@omarjaz hey just saw this message. It will work with Spanish texts, but I'm not sure how good it will be (since I don't know Spanish I can't do much on this front).