confident-ai / deepeval

The LLM Evaluation Framework
https://docs.confident-ai.com/
Apache License 2.0
2.79k stars 200 forks source link

Error while using the most popular opensource chat models #872

Open adkakne opened 1 month ago

adkakne commented 1 month ago

Describe the bug I encounter 'ValueError: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model.' while using most popular open source chat models in DeepEval framework.

To Reproduce

"""
This script attempts to use popular open source LLMs in DeepEval framework. 

1. Custom LLM class - navigate to section 'Mistral 7b example' 
    in https://docs.confident-ai.com/docs/metrics-introduction.
    NOTE - performed minor modifications to make it work for any open-source LLM. 
    NOTE - Opensource LLMs tried are - 
        "Qwen/Qwen2-72B-Instruct"  (#1 on HF leaderboard for chat models)
        "meta-llama/Meta-Llama-3-70B-Instruct" (#2 on HF leaderboard for chat models)
        "mistralai/Mixtral-8x7B-Instruct-v0.1" 
        "mistralai/Mistral-7B-v0.1" (recommended by DeepEval docs)
        "microsoft/Phi-3-medium-4k-instruct" (#5 on HF leaderboard for chat models)

2. Unit test is taken from -
    https://github.com/confident-ai/deepeval?tab=readme-ov-file#writing-your-first-test-case
    NOTE - set model param in test_case function to the open-source LLM of user's choice. 
"""

from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.test_case import LLMTestCase
from huggingface_hub import login
import os 
from transformers import AutoModelForCausalLM, AutoTokenizer

# wrapper class around HuggingFace API
class CustomLLM(DeepEvalBaseLLM):
    def __init__(self, name, model, tokenizer):
        self.name = name
        self.model = model
        self.tokenizer = tokenizer

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        model = self.load_model()
        device = "cuda" # the device to load the model onto
        model_inputs = self.tokenizer([prompt], return_tensors="pt").to(device)
        generated_ids = model.generate(**model_inputs, max_new_tokens=100)
        return self.tokenizer.batch_decode(generated_ids)[0]

    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)

    def get_model_name(self):
        return self.name 

# step 0 - user input
cuda_available_devices = "<add your visible devices>"
cache_dir="<add your cache dir>"
hf_token = "<add your HF token>"

# step 1 - set environment variables
os.environ['CUDA_VISIBLE_DEVICES'] = cuda_available_devices
os.environ['HF_HOME'] = cache_dir
login(token=hf_token, add_to_git_credential=True)

# step 2 -  run unit test case with open source LLM of your choice
def wrap_hf_api(name, cache_dir):
    model = AutoModelForCausalLM.from_pretrained(name, 
                                                 device_map="auto", 
                                                cache_dir=cache_dir)
    tokenizer = AutoTokenizer.from_pretrained(name, cache_dir=cache_dir)
    llm = CustomLLM(name=name, model=model, tokenizer=tokenizer)
    return llm

def test_case():
    test_case = LLMTestCase(
        input="What if these shoes don't fit?",
        actual_output="We offer a 30-day full refund at no extra costs.",
        retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
    )
    llm = wrap_hf_api("mistralai/Mistral-7B-v0.1", cache_dir)
    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5, model=llm)
    assert_test(test_case, [answer_relevancy_metric])

Expected behavior After running deepeval test run test_chatbot.py, I get ValueError: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model. for all 5 opensource models I tested.

Desktop (please complete the following information):

Additional context DeepEval version is 0.21.64.

Arietemag commented 1 month ago

This issue is caused by expecting JSON output from the LLM. The prompt template provides instructions for the LLM to provide JSON output. However, you cannot expect the model to return JSON every time. Even more complex models like GPT4 can sometimes make mistakes. Contributors could try to solve it by integrating, for example, Instructor https://github.com/jxnl/instructor/ in DeepEval

engineerbharath12 commented 1 month ago

@penguine-ip Is it possible to take a look at this issue please ?