confident-ai / deepeval

The LLM Evaluation Framework
https://docs.confident-ai.com/
Apache License 2.0
2.82k stars 204 forks source link

Getting `JSONDecodeError` error when running simple `AnswerRelevancyMetric` example #499

Closed ymzayek closed 6 months ago

ymzayek commented 6 months ago

Describe the bug I'm getting a JSONDecodeError when trying to run a simple example based off one of your examples but using a custom model with prometheus13b llm model. It seems to come from https://github.com/confident-ai/deepeval/blob/68d1f59b3aa23d184e7a3b1ae731ce126ab89863/deepeval/utils.py#L98

Below I provide reproducible code, the print of jsonStr variable a,d the full traceback

To Reproduce Steps to reproduce the behavior:

Reproducible code ```python from deepeval.metrics import AnswerRelevancyMetric from deepeval.test_case import LLMTestCase from transformers import AutoModelForCausalLM, AutoTokenizer from deepeval.models.base import DeepEvalBaseLLM import json # name of model to use as the LLM evaluator # model_name = "kaist-ai/prometheus-7b-v1.0" model_name = "kaist-ai/prometheus-13b-v1.0" # Derived from https://docs.confident-ai.com/docs/metrics-introduction#mistral-7b-example class CustomModel(DeepEvalBaseLLM): def __init__( self, model, tokenizer ): self.model = model self.tokenizer = tokenizer def load_model(self): return self.model def _call(self, prompt: str) -> str: model = self.load_model() device = "cuda" # the device to load the model onto model_inputs = self.tokenizer( [prompt], return_tensors="pt" ).to(device) model.to(device) generated_ids = model.generate( **model_inputs, max_new_tokens=100, do_sample=True ) return self.tokenizer.batch_decode(generated_ids)[0] def get_model_name(self): return "Custom model" # create custom model from custom class model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", ) tokenizer = AutoTokenizer.from_pretrained(model_name) custom_model = CustomModel(model=model, tokenizer=tokenizer) # Output of LLM application actual_output = "We offer a 30-day full refund at no extra costs." # Replace this with the actual retrieved context from your RAG pipeline retrieval_context = [ "All customers are eligible for a 30 day full refund at no extra costs." ] input = "What if these shoes don't fit?" metric = AnswerRelevancyMetric( threshold=0.5, model=custom_model, include_reason=True, ) test_case = LLMTestCase( input=input, actual_output=actual_output, retrieval_context=retrieval_context, ) metric.measure(test_case) print(metric.score) print(metric.reason) ```

Desktop (please complete the following information):

Additional context

Print of jsonStr variable:

{
    "statements": ["Shoes.", "Shoes can be refunded at no extra cost", "Thanks for asking the question!"]
}
===== END OF EXAMPLE ======

Text:
We offer a 30-day full refund at no extra costs.

**
IMPORTANT: Please make sure to only return in JSON format, with the "statements" key as a list of strings. No words or explaination is needed.
**

JSON:
{
    "statements": ["We", "offer", "a", "30-day", "full", "refund", "at", "no", "extra", "costs"]
}
Full traceback ```bash Traceback (most recent call last): File "/home/yasmin/micromamba/envs/deepeval/lib/python3.11/site-packages/deepeval/utils.py", line 98, in trimAndLoadJson return json.loads(jsonStr) ^^^^^^^^^^^^^^^^^^^ File "/home/yasmin/micromamba/envs/deepeval/lib/python3.11/json/__init__.py", line 346, in loads return _default_decoder.decode(s) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yasmin/micromamba/envs/deepeval/lib/python3.11/json/decoder.py", line 340, in decode raise JSONDecodeError("Extra data", s, end) json.decoder.JSONDecodeError: Extra data: line 4 column 1 (char 110) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/yasmin/ia/llm-eval/deepeval/tests/evaluate.py", line 81, in metric.measure(test_case) File "/home/yasmin/micromamba/envs/deepeval/lib/python3.11/site-packages/deepeval/metrics/answer_relevancy.py", line 46, in measure self.statements: List[str] = self._generate_statements( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yasmin/micromamba/envs/deepeval/lib/python3.11/site-packages/deepeval/metrics/answer_relevancy.py", line 119, in _generate_statements data = trimAndLoadJson(res) ^^^^^^^^^^^^^^^^^^^^ File "/home/yasmin/micromamba/envs/deepeval/lib/python3.11/site-packages/deepeval/utils.py", line 100, in trimAndLoadJson raise ValueError( ValueError: Error: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model. ```
AndresPrez commented 6 months ago

@ymzayek Im getting similar error... Notice your statements...

In my case I was using Google's Text Bison model with temperature 0.0. I noticed that the resulting statements that are getting generated from deepeval's prompt does not always work as expected, because the answer is getting split into a very long list of them ("statements"). When that happens the models try to generate an equally long output JSON, however "cheaper" models tend to have a max output token limitation so the generated JSON gets truncated and deepeval code fails to parse it.

penguine-ip commented 6 months ago

hey @ymzayek thanks for bringing this up. As cautioned in the docs using a custom evaluation model is extremely risky because for example, your prometheus model isn't able to reason what a "statement" is despite given a very similar example. Regardless of the JSON error or not, the fact that it splitted the sentence into words means the model isn't capable enough for evaluation.

I'm also curious why the choice of prometheus. Prometheus was trained for evaluation but for a form filling paradigm with clear scoring rubrics, which isn't the technique used in the answer relevancy metric. Here's a good read if you want to learn more about it: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation

Look at this more sophisticated and realistic example for example:

answer = """
Meditation offers a rich tapestry of benefits that touch upon various aspects of well-being. On a mental level, 
it greatly reduces stress and anxiety, fostering enhanced emotional health. This translates to better emotional 
regulation and a heightened sense of overall well-being. Interestingly, the practice of meditation has been around 
for centuries, evolving through various cultures and traditions, which underscores its timeless relevance.

Physically, it contributes to lowering blood pressure and alleviating chronic pain, which is pivotal for long-term health. 
Improved sleep quality is another significant benefit, aiding in overall physical restoration. Cognitively, meditation is a 
boon for enhancing attention span, improving memory, and slowing down age-related cognitive decline. Amidst these benefits, 
meditation's role in cultural and historical contexts is a fascinating side note, though not directly related to its health benefits.

Such a comprehensive set of advantages makes meditation a valuable practice for individuals seeking holistic improvement i
n both mental and physical health, transcending its historical and cultural origins.
"""

one = """
Meditation is an ancient practice, rooted in various cultural traditions, where individuals 
engage in mental exercises like mindfulness or concentration to promote mental clarity, emotional 
calmness, and physical relaxation. This practice can range from techniques focusing on breath, visual 
imagery, to movement-based forms like yoga. The goal is to bring about a sense of peace and self-awareness, 
enabling individuals to deal with everyday stress more effectively.
"""

two = """
One of the key benefits of meditation is its impact on mental health. It's widely used as a tool to 
reduce stress and anxiety. Meditation helps in managing emotions, leading to enhanced emotional health. 
It can improve symptoms of anxiety and depression, fostering a general sense of well-being. Regular practice 
is known to increase self-awareness, helping individuals understand their thoughts and emotions more clearly 
and reduce negative reactions to challenging situations.
"""

three = """
Meditation has shown positive effects on various aspects of physical health. It can lower blood pressure, 
reduce chronic pain, and improve sleep. From a cognitive perspective, meditation can sharpen the mind, increase 
attention span, and improve memory. It's particularly beneficial in slowing down age-related cognitive decline and 
enhancing brain functions related to concentration and attention.
"""

def test_answer_relevancy():
    metric = AnswerRelevancyMetric(threshold=0.5)
    test_case = LLMTestCase(
        input="What are the primary benefits of meditation?",
        actual_output=answer,
        retrieval_context=[one, two, three],
    )
    assert_test(test_case, [metric])

The GPT models are able to extract statements made like this:

Meditation offers a rich tapestry of benefits that touch upon various aspects of well-being.

On a mental level, it greatly reduces stress and anxiety, fostering enhanced emotional health.

This translates to better emotional regulation and a heightened sense of overall well-being.

Interestingly, the practice of meditation has been around for centuries, evolving through various cultures and traditions, which underscores its timeless relevance.

Such a comprehensive set of advantages makes meditation a valuable practice for individuals seeking holistic improvement in both mental and physical health, transcending its historical and cultural origins.

Which custom models find it challenging to carry out.

ymzayek commented 6 months ago

Ok thanks for the explanation. I was getting to the same conclusion but had misunderstood a bit about Prometheus as well wanting to run a quick test of an evaluation pipeline. Do you have recommendations of models to use outside of GPT if I wanted to specifically focus on evaluating a chatbot that uses RAG?

penguine-ip commented 6 months ago

@ymzayek I used to recommend mistral 7B, until more and more users started reporting problems around generating valid JSON outputs. I can't guarantee Llama2 as well, I heard you need some fine-tuning to get it to work better. What's the main concern with OpenAI? If it is data security, there's an option for azure openai.

Here's an example of how someone else managed to get their open source model to output JSON, I"m not so familiar with it but i'm going to link it in case you find it helpful: https://christophergs.com/blog/ai-engineering-evaluation-with-deepeval-and-open-source-models#simple (scroll down a little to "A more complex example")

We have a discord by the way, come join us: https://discord.com/invite/a3K9c8GRGt

ymzayek commented 6 months ago

I have made it work with the blog example you gave @penguine-ip thanks! GEval metric also seems to work well with a custom model (tested with Mixtral-8x7B)

penguine-ip commented 6 months ago

@ymzayek I'm glad it worked out, we'll be pushing more improvements to GEval as well by end of week :)

penguine-ip commented 5 months ago

Hey @ymzayek , I was wondering can you show me your mistral example? I've a few folks asking how to use grammar files with mistral 7B but i only have the example for llama

ymzayek commented 5 months ago

@penguine-ip my code is pretty much the same as the last code block in the link you sent me. I'm using "TheBloke/Mistral-7B-Instruct-v0.2-GGUF" file "mistral-7b-instruct-v0.2.Q5_K_M.gguf" for the custom model and this JSON grammar file from llamacpp to constrain to JSON.

See snippet:

Python script Replace input and actual_output given to `LLMTestCase` with own data and put `json.gbnf` in path and this should run as is: ```python from deepeval.metrics import GEval from deepeval.test_case import LLMTestCase, LLMTestCaseParams from deepeval.models.base import DeepEvalBaseLLM from llama_cpp import Llama, LlamaGrammar class CustomModel(DeepEvalBaseLLM): def __init__( self, model, ): self.model = model def load_model(self): return self.model def load_grammar(self): file_path = "./json.gbnf" with open(file_path, "r") as handler: content = handler.read() return LlamaGrammar.from_string(content) def _call(self, prompt: str) -> str: model = self.load_model() response = model.create_completion( prompt, max_tokens=-1, grammar=self.load_grammar() ) return response["choices"][0]["text"] def get_model_name(self): return "Custom model" model_name = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF" filename = "mistral-7b-instruct-v0.2.Q5_K_M.gguf" llm = Llama.from_pretrained( repo_id=model_name, filename=filename, n_ctx=4096, n_batch=8, chat_format="mistral-instruct", ) custom_model = CustomModel(model=llm) coherence_metric = GEval( name="Coherence", model=custom_model, criteria="Coherence - determine if the actual output is coherent with the input.", evaluation_params=[ LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT ], ) test_case = LLMTestCase( input=data["input"].iloc[0], actual_output=data["output"].iloc[0], ) coherence_metric.measure(test_case) coherence = coherence_metric.score ```

I haven't found a way to make it work reliably without using llamacpp and even with the llamacpp set up it doesn't work reliably with any model. Mistral 7b instruct and mistral 7b should both work fine but with bigger models it tends to fail sometimes.

penguine-ip commented 5 months ago

@ymzayek thank you so much! Its very helpful. GEval also got improved and we now use async instead of threading so your segmentation fault shouldn't be there anymore. Have a nice day.