explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀
https://docs.ragas.io
Apache License 2.0
7.04k stars 712 forks source link

Getting error "Failed to parse output. Returning None" on faithfulness metric #1150

Open ableiweiss opened 2 months ago

ableiweiss commented 2 months ago

I am getting error "Failed to parse output. Returning None" on faithfulness metric for some inputs. This is inconsistent behavior as it is haphazard and sometimes works, sometimes doesn't for the same input. I don't see this error on any of the other metrics. I get the error regardless of the LLM I use (Mixtral, LLama, etc.)

Any solutions to this issue?

dosubot[bot] commented 2 months ago

The error "Failed to parse output. Returning None" when using the faithfulness metric is likely due to the _faithfulness_output_parser failing to parse the output from the LLM. This can happen if the output format does not match the expected JSON structure.

To address this issue, you can try the following solutions:

  1. Increase max_retries: The max_retries parameter controls how many times the system will attempt to parse the output. Increasing this value might help in cases where the output is inconsistently formatted.

  2. Check Output Format: Ensure that the output from the LLM strictly follows the expected JSON format. You can add more robust error handling and logging to capture and analyze the outputs that fail to parse.

  3. Adjust Reproducibility: The _reproducibility parameter should be an odd number. Ensure that it is set correctly to avoid issues with even numbers.

Here is an example of how you might adjust the max_retries and add logging for better debugging:

class Faithfulness(MetricWithLLM):
    # ... other methods and properties ...

    async def _ascore(
        self: t.Self, row: t.Dict, callbacks: Callbacks, is_async: bool
    ) -> float:
        assert self.llm is not None, "LLM is not set"

        p_value = self._create_statements_prompt(row)
        statements = await self.llm.generate(
            p_value,
            callbacks=callbacks,
            is_async=is_async,
        )
        statements = await _statements_output_parser.aparse(
            statements.generations[0][0].text, p_value, self.llm, self.max_retries
        )

        if statements is None:
            logger.error("Failed to parse statements output. Returning NaN.")
            return np.nan

        statements = [item["simpler_statements"] for item in statements.dicts()]
        statements = [item for sublist in statements for item in sublist]

        assert isinstance(statements, t.List), "statements must be a list"

        p_value = self._create_nli_prompt(row, statements)
        nli_result = await self.llm.generate(
            p_value,
            callbacks=callbacks,
            is_async=is_async,
            n=self._reproducibility,
        )

        nli_result_text = [
            nli_result.generations[0][i].text for i in range(self._reproducibility)
        ]
        faithfulness_list = [
            await _faithfulness_output_parser.aparse(
                text, p_value, self.llm, self.max_retries
            )
            for text in nli_result_text
        ]

        faithfulness_list = [
            faith.dicts() for faith in faithfulness_list if faith is not None
        ]

        if faithfulness_list:
            faithfulness_list = ensembler.from_discrete(
                faithfulness_list,
                "verdict",
            )

            faithfulness_list = StatementFaithfulnessAnswers.parse_obj(
                faithfulness_list
            )
        else:
            logger.error("Failed to parse faithfulness output. Returning NaN.")
            return np.nan

        return self._compute_score(faithfulness_list)

By increasing max_retries and adding error logging, you can better understand and potentially mitigate the parsing issues [1].

To continue talking to Dosu, mention @dosu.

ableiweiss commented 2 months ago

The issue is that faithfulness_list is empty, so it returnsnp.nan. The list nli_result_text looks fine, are there any unique requirements for the prompt that might cause this issue?

dosubot[bot] commented 2 months ago

The faithfulness_list being empty indicates that the _faithfulness_output_parser.aparse method is returning None for all items in nli_result_text. This can happen if the parsing fails or if the LLM does not generate any valid faithfulness outputs [1].

To address this, ensure the following:

  1. Output Format: Verify that the output from the LLM strictly follows the expected format. The nli_result_text should be a list of strings, each containing the text generated by the LLM [1].

  2. Prompt Requirements: Ensure that the prompt generated by _create_nli_prompt is correctly formatted. The prompt should include the context and statements in a way that the LLM can process and generate the expected output. The PromptValue object should be correctly formatted with the context and statements [1].

  3. Error Handling and Logging: Add more robust error handling and logging to capture and analyze the outputs that fail to parse. This can help identify if there are specific patterns or issues with the generated outputs.

Here is an example of how you might adjust the code to add more logging for better debugging:

class Faithfulness(MetricWithLLM):
    # ... other methods and properties ...

    async def _ascore(
        self: t.Self, row: t.Dict, callbacks: Callbacks, is_async: bool
    ) -> float:
        assert self.llm is not None, "LLM is not set"

        p_value = self._create_statements_prompt(row)
        statements = await self.llm.generate(
            p_value,
            callbacks=callbacks,
            is_async=is_async,
        )
        statements = await _statements_output_parser.aparse(
            statements.generations[0][0].text, p_value, self.llm, self.max_retries
        )

        if statements is None:
            logger.error("Failed to parse statements output. Returning NaN.")
            return np.nan

        statements = [item["simpler_statements"] for item in statements.dicts()]
        statements = [item for sublist in statements for item in sublist]

        assert isinstance(statements, t.List), "statements must be a list"

        p_value = self._create_nli_prompt(row, statements)
        nli_result = await self.llm.generate(
            p_value,
            callbacks=callbacks,
            is_async=is_async,
            n=self._reproducibility,
        )

        nli_result_text = [
            nli_result.generations[0][i].text for i in range(self._reproducibility)
        ]
        faithfulness_list = [
            await _faithfulness_output_parser.aparse(
                text, p_value, self.llm, self.max_retries
            )
            for text in nli_result_text
        ]

        faithfulness_list = [
            faith.dicts() for faith in faithfulness_list if faith is not None
        ]

        if faithfulness_list:
            faithfulness_list = ensembler.from_discrete(
                faithfulness_list,
                "verdict",
            )

            faithfulness_list = StatementFaithfulnessAnswers.parse_obj(
                faithfulness_list
            )
        else:
            logger.error("Failed to parse faithfulness output. Returning NaN.")
            return np.nan

        return self._compute_score(faithfulness_list)

By ensuring the prompt is correctly formatted and adding error logging, you can better understand and potentially mitigate the parsing issues [1].

To continue talking to Dosu, mention @dosu.

huangxuyh commented 2 months ago

same question

ableiweiss commented 2 months ago

I see that the last statement in nli_result_text get cut off and does not contain a reasonor verdictfield. The rest of the statements look good. Is there a way to just ignore the erroneous statements and still use the rest for the metric?

jjmachan commented 2 months ago

@ableiweiss what is the context window of the models you are using? and how big are the models you tested?

ableiweiss commented 2 months ago

context window is 8k, tried LLama3-70b-instruct and Mixtral-8x-7b. Didn't have issues with the other metrics.

Is there a particular open source LLM or context window you recommend to alleviate this?

jjmachan commented 2 months ago

@ableiweiss maybe try Llama3.1 which has 128K window?

the first thing is to figure out if it is a context over fitting problem. If you are using a tracing tool it should be easy (I'm guessing) or you can use this to help you out https://docs.ragas.io/en/stable/howtos/applications/cost.html

cpolcino commented 1 month ago

@dosu can you give the path to _faithfulness_output_parser.aparse?

dosubot[bot] commented 1 month ago

Yes, you can ignore the erroneous statements that do not contain a reason or verdict field and still use the rest for the faithfulness metric. You can filter out such statements before computing the score. Here is an example of how you can modify the _compute_score method to achieve this:

def _compute_score(self, answers: StatementFaithfulnessAnswers):
    # Filter out statements that do not contain a reason or verdict
    valid_answers = [
        answer for answer in answers.__root__
        if answer.reason and answer.verdict is not None
    ]

    # check the verdicts and compute the score
    faithful_statements = sum(
        1 if answer.verdict else 0 for answer in valid_answers
    )
    num_statements = len(valid_answers)
    if num_statements:
        score = faithful_statements / num_statements
    else:
        logger.warning("No valid statements were generated from the answer.")
        score = np.nan

    return score

This code filters out any StatementFaithfulnessAnswer objects that do not have a reason or verdict before calculating the faithfulness score [1].

For the path to _faithfulness_output_parser.aparse, it is located in the src/ragas/metrics/_faithfulness.py file:

_faithfulness_output_parser = RagasoutputParser(
    pydantic_object=StatementFaithfulnessAnswers
)

This parser is used to process the results generated by the language model (LLM) and convert them into a structured format that can be further analyzed [1].

To continue talking to Dosu, mention @dosu.

shahfasal commented 1 week ago

I have encountered a similar issue(using mixtral model) where I get the error Failed to parse output. Returning None on the faithfulness metric for some inputs. This error occurs inconsistently – sometimes it works, sometimes it doesn’t, even with the same input. Additionally, this issue appears regardless of the LLM used (Mixtral, Llama, etc.).

My findings and solution: After troubleshooting, I identified that the root cause lies in missing fields in the JSON output from the model, specifically the verdict field. When the model is unable to infer a judgment clearly, it might return incomplete output, causing the parser to fail.

To fix this, I found that modifying the prompt to explicitly ensure that both the reason and verdict fields are included in the model's output solves the problem. Here’s the change I implemented:

Updated Prompt in _faithfullness.py goto NLI_STATEMENTS_MESSAGE:

instruction="Your task is to judge the faithfulness of a series of statements based on a given context. For each statement you must return verdict as 1 if the statement can be directly inferred based on the context or 0 if the statement can not be directly inferred based on the context. Ensure that each statement includes both a reason and a verdict. Do not omit any fields."

jjmachan commented 6 days ago

@shahfasal thanks a lot for sharing your solution 🙌🏽 @shahules786 do you think we should update the instruction to add this too?

shahfasal commented 6 days ago

@jjmachan @shahules786 if you're okay with it, I can create a PR. I'd love to contribute!