explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
6.57k stars 646 forks source link

Failed to parse output. #1228

Open g-hano opened 2 weeks ago

g-hano commented 2 weeks ago

Describe the bug Local LLMs either raise Timeout error or Fails to parse output.

Ragas version: 0.1.15 Python version: 3.11.3

Code to Reproduce

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
import pandas as pd
df = pd.read_csv("output.csv", sep=";")

data_samples = {
    'question': df['question'].tolist(),
    'answer': df['answer'].tolist(),
    'contexts': df['contexts'].apply(lambda x: [x] if isinstance(x, str) else x).tolist(),
    'ground_truth': df['ground_truth'].tolist()
}

from datasets import Dataset 
dataset = Dataset.from_dict(data_samples)

from ragas import evaluate
from ragas.metrics import (faithfulness, 
                           answer_correctness,    
                           answer_relevancy,
                           context_recall,
                           context_precision)

from langchain_community.llms.huggingface_endpoint import HuggingFaceEndpoint
end = HuggingFaceEndpoint(repo_id="mistralai/Mistral-7B-Instruct-v0.3", max_new_tokens=512)
huggingface_llm = ChatHuggingFace(llm=end, tokenizer=tokenizer)
huggingface_embeddings = HuggingFaceEmbeddings(model_name="nomic-ai/nomic-embed-text-v1.5")

metrics=[faithfulness, 
        answer_correctness,    
        answer_relevancy,
        context_recall,
        context_precision]

score = evaluate(dataset=dataset,
        metrics=metrics,
        llm=huggingface_llm,
        embeddings=huggingface_embeddings,
        raise_exceptions=False
)

Error trace

Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Exception raised in Job[304]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
  field required (type=value_error.missing)
__root__ -> 0 -> verdict
  field required (type=value_error.missing))
Failed to parse output. Returning None.
Exception raised in Job[444]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
  field required (type=value_error.missing)
__root__ -> 0 -> verdict
  field required (type=value_error.missing))
Exception raised in Job[169]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
  field required (type=value_error.missing)
Failed to parse output. Returning None.
Failed to parse output. Returning None..
Exception raised in Job[309]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
  field required (type=value_error.missing)
__root__ -> 0 -> verdict
  field required (type=value_error.missing))
Failed to parse output. Returning None.
Exception raised in Job[174]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
  field required (type=value_error.missing)
__root__ -> 0 -> verdict
  field required (type=value_error.missing))
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Exception raised in Job[449]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
  field required (type=value_error.missing)
__root__ -> 0 -> verdict
  field required (type=value_error.missing))
Failed to parse output. Returning None.
Exception raised in Job[179]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
  field required (type=value_error.missing)
__root__ -> 0 -> verdict
  field required (type=value_error.missing))
Failed to parse output. Returning None.
Exception raised in Job[314]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
  field required (type=value_error.missing)
__root__ -> 0 -> verdict
  field required (type=value_error.missing))
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Exception raised in Job[184]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
  field required (type=value_error.missing)
__root__ -> 0 -> verdict
  field required (type=value_error.missing))
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Exception raised in Job[454]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
  field required (type=value_error.missing)
__root__ -> 0 -> verdict
  field required (type=value_error.missing))
Failed to parse output. Returning None.
Exception raised in Job[461]: ClientResponseError(429, message='Too Many Requests', url=URL('https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.3'))
Exception raised in Job[196]: ClientResponseError(429, message='Too Many Requests', url=URL('https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.3'))
Exception raised in Job[462]: ClientResponseError(429, message='Too Many Requests', url=URL('https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.3'))

Additional context it only evaluates for answer_correctness, other values are all NaN

dosubot[bot] commented 2 weeks ago

To resolve the "Failed to parse output" and "ValidationError" issues when using the evaluate function from the ragas library with the HuggingFaceEndpoint and Mistral-7B-Instruct-v0.3 model, you can follow these steps:

  1. Ensure JSON Schema Compliance: Make sure that the output from the model strictly adheres to the JSON schema expected by the PydanticOutputParser. The get_json_format_instructions function provides a template for the expected JSON format. Ensure that the model's output matches this format.

  2. Use the FIX_OUTPUT_FORMAT Prompt: The RagasoutputParser class has a mechanism to retry parsing the output if it fails initially. It uses the FIX_OUTPUT_FORMAT prompt to instruct the model to correct its output format. Ensure that this mechanism is properly utilized.

  3. Increase max_retries: If the output often fails to parse correctly, consider increasing the max_retries parameter in the aparse method to give the model more chances to correct its output.

Here is an example of how you might adjust the RagasoutputParser class to handle these issues:

class RagasoutputParser(PydanticOutputParser):
    async def aparse(  # type: ignore
        self, result: str, prompt: PromptValue, llm: BaseRagasLLM, max_retries: int = 3  # Increase retries
    ):
        try:
            output = super().parse(result)
        except OutputParserException:
            if max_retries != 0:
                p_value = FIX_OUTPUT_FORMAT.format(
                    prompt=prompt.to_string(), completion=result
                )
                output = await llm.generate(p_value)
                result = output.generations[0][0].text
                return await self.aparse(result, prompt, llm, max_retries - 1)
            else:
                logger.warning("Failed to parse output. Returning None.")
                return None
        return output

By increasing the max_retries parameter, you give the model more opportunities to correct its output format, which can help mitigate the "Failed to parse output" and "ValidationError" issues [1].

Additionally, there are specific configurations or parameters for the evaluate function in the ragas library that could affect the output parsing or model response. These include:

These parameters can influence how the evaluation is conducted and how the results are parsed and returned [2][3].

For the ContextPrecisionVerifications class, the expected JSON schema is:

{
  "__root__": [
    {
      "reason": "string",
      "verdict": 0 or 1
    }
  ]
}

Each ContextPrecisionVerification object contains the fields reason (a string) and verdict (an integer, 0 or 1) [4].

Lastly, regarding the ClientResponseError with a 429 status code, it indicates that you are hitting the rate limits of the HuggingFace API. You might need to implement rate limiting and retry logic to handle these errors gracefully. Unfortunately, I couldn't find specific details on how the HuggingFaceEndpoint class handles these errors within the repository.

To continue talking to Dosu, mention @dosu.