confident-ai / deepeval

The LLM Evaluation Framework
https://docs.confident-ai.com/
Apache License 2.0
3.21k stars 249 forks source link

Custom LLM Example do not Work #954

Open Otavio-Parraga opened 1 month ago

Otavio-Parraga commented 1 month ago

Describe the bug After setting the code, I always get an error with the following description:

AttributeError: 'list' object has no attribute 'find'

I tried to run the code on my local server, with some modifications, and thus on Google Colab, just copying and pasting the website example (https://docs.confident-ai.com/docs/guides-using-custom-llms#creating-a-custom-llm)

Running it on server leads to the same error, and with other metrics (such as SummarizationMetric) I get the same error AttributeError: 'list' object has no attribute 'find' but replacing find by claim

To Reproduce Google Colab with code: https://colab.research.google.com/drive/1JbzWggqaxMKakQSzuVSYXN1J-96ttjOk?usp=sharing

Expected behavior The lib should evaluate the metric.

Screenshots If applicable, add screenshots to help explain your problem. Captura de Tela 2024-08-14 às 16 58 46

Additional context The full error stack trace:


Event loop is already running. Applying nest_asyncio patch to allow async execution...
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/deepeval/metrics/answer_relevancy/answer_relevancy.py](https://localhost:8080/#) in _a_generate_statements(self, actual_output)
    231             try:
--> 232                 res: Statements = await self.model.a_generate(
    233                     prompt, schema=Statements

TypeError: CustomLlama3_8B.a_generate() got an unexpected keyword argument 'schema'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
7 frames
[<ipython-input-7-453899681174>](https://localhost:8080/#) in <cell line: 8>()
      6 
      7 metric = AnswerRelevancyMetric(model=custom_llm)
----> 8 metric.measure(test_case)

[/usr/local/lib/python3.10/dist-packages/deepeval/metrics/answer_relevancy/answer_relevancy.py](https://localhost:8080/#) in measure(self, test_case)
     50             if self.async_mode:
     51                 loop = get_or_create_event_loop()
---> 52                 loop.run_until_complete(
     53                     self.a_measure(test_case, _show_indicator=False)
     54                 )

[/usr/local/lib/python3.10/dist-packages/nest_asyncio.py](https://localhost:8080/#) in run_until_complete(self, future)
     96                 raise RuntimeError(
     97                     'Event loop stopped before Future completed.')
---> 98             return f.result()
     99 
    100     def _run_once(self):

[/usr/lib/python3.10/asyncio/futures.py](https://localhost:8080/#) in result(self)
    199         self.__log_traceback = False
    200         if self._exception is not None:
--> 201             raise self._exception.with_traceback(self._exception_tb)
    202         return self._result
    203 

[/usr/lib/python3.10/asyncio/tasks.py](https://localhost:8080/#) in __step(***failed resolving arguments***)
    230                 # We use the `send` method directly, because coroutines
    231                 # don't have `__iter__` and `__next__` methods.
--> 232                 result = coro.send(None)
    233             else:
    234                 result = coro.throw(exc)

[/usr/local/lib/python3.10/dist-packages/deepeval/metrics/answer_relevancy/answer_relevancy.py](https://localhost:8080/#) in a_measure(self, test_case, _show_indicator)
     85             self, async_mode=True, _show_indicator=_show_indicator
     86         ):
---> 87             self.statements: List[str] = await self._a_generate_statements(
     88                 test_case.actual_output
     89             )

[/usr/local/lib/python3.10/dist-packages/deepeval/metrics/answer_relevancy/answer_relevancy.py](https://localhost:8080/#) in _a_generate_statements(self, actual_output)
    236             except TypeError:
    237                 res = await self.model.a_generate(prompt)
--> 238                 data = trimAndLoadJson(res, self)
    239                 return data["statements"]
    240 

[/usr/local/lib/python3.10/dist-packages/deepeval/metrics/utils.py](https://localhost:8080/#) in trimAndLoadJson(input_string, metric)
    141     input_string: str, metric: Optional[BaseMetric] = None
    142 ) -> Any:
--> 143     start = input_string.find("{")
    144     end = input_string.rfind("}") + 1
    145 

AttributeError: 'list' object has no attribute 'find'
EjbejaranosAI commented 1 month ago

Hi @Otavio-Parraga,

I identified some errors in the implementation and made a few modifications to the script. However, after running the code, I noticed that all test results are returning 0. I'm unsure if this issue is due to a configuration problem or something else. Any guidance would be appreciated .

from huggingface_hub import login
from transformers import AutoModelForCausalLM, AutoTokenizer
from deepeval.benchmarks.mmlu.mmlu import MMLU
from deepeval.models.base_model import DeepEvalBaseLLM
login(token="your_hf_token")`

class Mistral7B(DeepEvalBaseLLM):

    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer

    def load_model(self):
        # Move the model to the GPU when loading
        return self.model.to("cuda")

    def generate(self, prompt: str) -> str:
        model = self.load_model()

        device = "cuda"
        model_inputs = self.tokenizer([prompt], return_tensors="pt", padding=True, truncation=True).to(device)

        generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
        return self.tokenizer.batch_decode(generated_ids)[0]

    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)

    def batch_generate(self, prompts):
        model = self.load_model()
        device = "cuda"

        model_inputs = self.tokenizer(prompts, return_tensors="pt", padding=True, truncation=True).to(device)

        generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
        return self.tokenizer.batch_decode(generated_ids)

    def get_model_name(self):
        return "Mistral 7B"

    def __call__(self, prompt: str) -> str:
        return self.generate(prompt)

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

#model = AutoModelForCausalLM.from_pretrained("oshizo/japanese-e5-mistral-1.9b")
#tokenizer = AutoTokenizer.from_pretrained("oshizo/japanese-e5-mistral-1.9b")

# Asignar el pad_token como eos_token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

mistral_7b = Mistral7B(model=model, tokenizer=tokenizer)
print(mistral_7b("Write me a joke"))

benchmark = MMLU()

results = benchmark.evaluate(model=mistral_7b, batch_size=5)
print("Overall Score: ", results)

print("Overall Score:", benchmark.overall_score)

print("Task-specific Scores: ", benchmark.task_scores)
svnv-svsv-jm commented 17 hours ago

Same here. This is still the case. Weird that their own tutorial code does not work...