explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
6.56k stars 643 forks source link

Getting Error: Runner in Executor raised an exception in Ragas evaluate using Ollama and giving Nan value in df #1099

Open divrajput opened 2 months ago

divrajput commented 2 months ago

I encountered an issue while evaluating a dataset using the ragas library with the Langchain LLM and Sentence Transformer embeddings. The process throws an exception during execution. Steps to Reproduce:

#Reading the DataFrame:
import pandas as pd
df = pd.read_csv("test_results_csv/test.csv")
df["contexts"] = df["contexts"].apply(lambda x: [x])
from datasets import Dataset
dataset = Dataset.from_pandas(df)
dataset[0]

Output:

{
    'question': 'What are the major sources of carbohydrates in the traditional Hawaiian diet?',
    'ground_truth': "The traditional Hawaiian diet was rich in carbohydrate sources primarily derived from 'uala (sweet potato), ulu (breadfruit), and kalo (taro). These foods were not only staple items but also provided a significant portion of the daily caloric intake. Sweet potatoes, breadfruit, and taro were cultivated extensively and formed the backbone of the Hawaiian nutritional intake, ensuring that the population had a steady and reliable source of energy. The high carbohydrate content of these foods supported the physical demands of daily activities and agricultural work.",
    'answer': ' The majority of the diet was made up of these fiber rich carbohydrate foods.',
    'contexts': ["['•  Describe the different types of simple and complex  carbohydrates  •  Describe the process of carbohydrate digestion and  absorption  •  Describe the functions of carbohydrates in the  body  •  Describe the body’s carbohydrate needs and how  personal choices can lead to health benefits or  consequences  Throughout history, carbohydrates have and continue to be a major  source of people’s diets worldwide. In ancient Hawai‘i the Hawaiians  obtained the majority of their calories from carbohydrate rich plants  like the ‘uala (sweet potato), ulu (breadfruit) and kalo (taro). For  example, mashed kalo or poi was a staple to meals for Hawaiians.  Research suggests that almost 78 percent of the diet was made up  of these fiber rich carbohydrate foods.1  Carbohydrates are the perfect nutrient to meet your body’s  nutritional needs. They nourish your brain and nervous system,  provide energy to all of your cells when within proper caloric limits,  and help keep your body fit and lean.', 'body fit and lean. Specifically, digestible  carbohydrates provide bulk in foods, vitamins, and minerals, while  1.\\xa0Fujita R, Braun KL, Hughes CK. (2004). The traditional  Hawaiian diet: a review of the literature. Pacific Health  Dialogue, 11(2). http:/ [/pacifichealthdialog.org.fj/](https://file+.vscode-resource.vscode-cdn.net/pacifichealthdialog.org.fj/) Volume2011/no2/ PHD1120220p2162022120Yamada20orig.pdf. Accessed  October 19, 2017.  230  |  Introduction']"]
}

Setting Up the Environment:

from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

from langchain_community.llms import Ollama

llm = Ollama(model="mistrallite_Q2_K:latest", temperature=0)
embeddings = SentenceTransformerEmbeddings(model_name='all-MiniLM-L6-v2')
langchain_llm = LangchainLLMWrapper(llm)
langchain_embeddings = LangchainEmbeddingsWrapper(embeddings)

import nest_asyncio
nest_asyncio.apply()
from ragas import evaluate
from ragas.run_config import RunConfig

from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)

result = evaluate(
    dataset=dataset,
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ],
    raise_exceptions=False,
    llm=langchain_llm,
    embeddings=langchain_embeddings
)

res_df = result.to_pandas()

Error

Runner in Executor raised an exception
Traceback (most recent call last):
  File "/Users/xxx/miniconda3/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/Users/xxx/miniconda3/lib/python3.12/site-packages/ragas/metrics/_context_recall.py", line 169, in _ascore
    results = await self.llm.generate(
              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/xxx/miniconda3/lib/python3.12/site-packages/ragas/llms/base.py", line 93, in generate
    return await agenerate_text_with_retry(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/xxx/miniconda3/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 189, in async_wrapped
    return await copy(fn, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/xxx/miniconda3/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 111, in __call__
    do = await self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/xxx/miniconda3/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 153, in iter
    result = await action(retry_state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/xxx/miniconda3/lib/python3.12/site-packages/tenacity/_utils.py", line 99, in inner
    return call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/xxx/miniconda3/lib/python3.12/site-packages/tenacity/__init__.py", line 398, in <lambda>
    self._add_action_func(lambda rs: rs.outcome.result())
dosubot[bot] commented 2 months ago

Hey there, @divrajput! I'm here to help you with the issue you've encountered. Let's work together to solve this bug and get things back on track!

To resolve the issue of encountering an exception and resulting in NaN values in the DataFrame while evaluating a dataset using the ragas library with the Langchain LLM and Sentence Transformer embeddings, you can set the raise_exceptions parameter to False in the evaluate function. This will ensure that the evaluation will return np.nan for the row that failed instead of raising an exception.

Here is the relevant part of the code where you can set this parameter:

result = evaluate(
    dataset=dataset,
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ],
    raise_exceptions=False,  # Set this to False to handle exceptions gracefully
    llm=langchain_llm,
    embeddings=langchain_embeddings
)

By setting raise_exceptions to False, the evaluation will handle exceptions gracefully and return np.nan for any rows that encounter issues [1].

Additionally, ensure that your RunConfig is properly set up to handle retries and timeouts, which can help mitigate transient errors:

from ragas.run_config import RunConfig

run_config = RunConfig(
    max_retries=10,
    max_wait=60,
    thread_timeout=300,
    log_tenacity=True
)

result = evaluate(
    dataset=dataset,
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ],
    raise_exceptions=False,
    llm=langchain_llm,
    embeddings=langchain_embeddings,
    run_config=run_config  # Pass the run_config here
)

This configuration will help manage retries and timeouts more effectively [2][1].

To continue talking to Dosu, mention @dosu.

divrajput commented 2 months ago

While evaluating my dataset, I am encountering NaN values for the metrics faithfulness and answer_relevancy. However, I am successfully getting values for context_precision and context_recall. I have increased the run configuration values to ensure the process has sufficient retries and wait times:

run_config = RunConfig(
    max_retries=15,
    max_wait=120,
    thread_timeout=500,
    log_tenacity=True
)
Output: question ground_truth answer contexts context_precision context_recall faithfulness answer_relevancy
What are the major sources of carbohydrates in... The traditional Hawaiian diet was rich in carb... The majority of the diet was made up of these... [['• Describe the different types of simple a... 1.0 1.0 NaN 0.552834
Why are carbohydrates considered essential for... Carbohydrates are essential for the human body... Carbohydrates provide a ready source of energ... [['Protein Necessary for tissue formation, ce... 1.0 1.0 NaN NaN

Request: Please guide on resolving the NaN values for faithfulness and answer_relevancy metrics.

dosubot[bot] commented 2 months ago

The NaN values for the faithfulness and answer_relevancy metrics are likely due to issues in the statement generation or parsing process. Specifically, the _create_statements_prompt and _create_nli_prompt methods, or the LLM's response to these prompts, might be causing the problem.

Here are a few steps to debug and resolve the issue:

  1. Check the LLM's Response: Ensure that the LLM is generating valid responses for the prompts created in _create_statements_prompt and _create_nli_prompt.

  2. Verify Statement Parsing: Ensure that the statements generated by the LLM are correctly parsed by _statements_output_parser and _faithfulness_output_parser.

  3. Inspect the Prompts: Verify that the prompts created in _create_statements_prompt and _create_nli_prompt are correctly formatted and contain the necessary information.

Here is a modified version of the _ascore method with additional logging to help debug the issue:

async def _ascore(
    self: t.Self, row: t.Dict, callbacks: Callbacks, is_async: bool
) -> float:
    """
    returns the NLI score for each (q, c, a) pair
    """
    assert self.llm is not None, "LLM is not set"

    # Create statements prompt
    p_value = self._create_statements_prompt(row)
    statements = await self.llm.generate(
        p_value,
        callbacks=callbacks,
        is_async=is_async,
    )
    statements = await _statements_output_parser.aparse(
        statements.generations[0][0].text, p_value, self.llm, self.max_retries
    )

    if statements is None:
        logger.warning("No statements generated by the LLM.")
        return np.nan

    statements = [item["simpler_statements"] for item in statements.dicts()]
    statements = [item for sublist in statements for item in sublist]

    assert isinstance(statements, t.List), "statements must be a list"

    # Create NLI prompt
    p_value = self._create_nli_prompt(row, statements)
    nli_result = await self.llm.generate(
        p_value,
        callbacks=callbacks,
        is_async=is_async,
        n=self._reproducibility,
    )

    nli_result_text = [
        nli_result.generations[0][i].text for i in range(self._reproducibility)
    ]
    faithfulness_list = [
        await _faithfulness_output_parser.aparse(
            text, p_value, self.llm, self.max_retries
        )
        for text in nli_result_text
    ]

    faithfulness_list = [
        faith.dicts() for faith in faithfulness_list if faith is not None
    ]

    if faithfulness_list:
        faithfulness_list = ensembler.from_discrete(
            faithfulness_list,
            "verdict",
        )

        faithfulness_list = StatementFaithfulnessAnswers.parse_obj(
            faithfulness_list
        )
    else:
        logger.warning("No faithfulness results generated by the LLM.")
        return np.nan

    return self._compute_score(faithfulness_list)

Additionally, for the answer_relevancy metric, ensure that the LLM is correctly generating questions and identifying noncommittal answers. Here is the relevant code snippet for answer_relevancy:

@dataclass
class AnswerRelevancy(MetricWithLLM, MetricWithEmbeddings):
    name: str = "answer_relevancy"
    evaluation_mode: EvaluationMode = EvaluationMode.qac
    question_generation: Prompt = field(default_factory=lambda: QUESTION_GEN)
    strictness: int = 3

    def calculate_similarity(
        self: t.Self, question: str, generated_questions: list[str]
    ):
        assert self.embeddings is not None
        question_vec = np.asarray(self.embeddings.embed_query(question)).reshape(1, -1)
        gen_question_vec = np.asarray(
            self.embeddings.embed_documents(generated_questions)
        ).reshape(len(generated_questions), -1)
        norm = np.linalg.norm(gen_question_vec, axis=1) * np.linalg.norm(
            question_vec, axis=1
        )
        return (
            np.dot(gen_question_vec, question_vec.T).reshape(
                -1,
            )
            / norm
        )

    def _calculate_score(
        self, answers: t.Sequence[AnswerRelevanceClassification], row: t.Dict
    ) -> float:
        question = row["question"]
        gen_questions = [answer.question for answer in answers]
        committal = np.any([answer.noncommittal for answer in answers])
        if all(q == "" for q in gen_questions):
            logger.warning(
                "Invalid JSON response. Expected dictionary with key 'question'"
            )
            score = np.nan
        else:
            cosine_sim = self.calculate_similarity(question, gen_questions)
            score = cosine_sim.mean() * int(not committal)

        return score

    def _create_question_gen_prompt(self, row: t.Dict) -> PromptValue:
        ans, ctx = row["answer"], row["contexts"]
        return self.question_generation.format(answer=ans, context="\n".join(ctx))

    async def _ascore(self, row: t.Dict, callbacks: Callbacks, is_async: bool) -> float:
        assert self.llm is not None, "LLM is not set"

        prompt = self._create_question_gen_prompt(row)
        result = await self.llm.generate(
            prompt,
            n=self.strictness,
            callbacks=callbacks,
            is_async=is_async,
        )

        answers = [
            await _output_parser.aparse(result.text, prompt, self.llm)
            for result in result.generations[0]
        ]
        if any(answer is None for answer in answers):
            return np.nan

        answers = [answer for answer in answers if answer is not None]
        return self._calculate_score(answers, row)

This code includes additional logging to help identify where the process might be failing. If the LLM is not generating valid statements or NLI results, the logs will indicate this, allowing you to further investigate the issue [1][2].

To continue talking to Dosu, mention @dosu.

jjmachan commented 1 month ago

@divrajput it will be most likely due to json following capabilites of the model. Do you have any tracing tools setup?