Ensuring Consistent Question Sets for Summarization Score Evaluation

noweymik commented 3 months ago

Hi!

I have a question regarding the generation of question sets for the Summarization Score metric. I am working on creating a high-quality summary and need reliable evaluation metrics to assess it. I have found the Summarization Score metric to be very useful for checking the quality of my summaries.

However, I am experiencing some issues with the volatility of the scores. Even though the input is always the same, the questions generated differ each time. Is there a way to ensure that the set of questions remains consistent when the input is the same?

I have an idea related to this issue. If it is not possible to generate a consistent set of questions using an LLM, what do you think about attempting to score the summary multiple times (n times) and averaging the results to get a more stable score? If you have any other good suggestions, I would greatly appreciate it.

dosubot[bot] commented 3 months ago

Hi @noweymik,

I'm here to assist you with any questions or issues you may have. Let's work together to address your concerns.

Dosu

To ensure that the set of questions remains consistent when the input is the same, you should focus on the deterministic nature of the question generation process. The SummarizationScore class uses a prompt-based approach to generate questions and answers. If the input text and keyphrases are the same, the generated questions should ideally be consistent, given that the underlying language model and prompt remain unchanged.

However, if you observe variability in the generated questions, it might be due to the non-deterministic nature of the language model's generation process. In such cases, averaging the results of multiple runs can help achieve a more stable score.

Here is the relevant part of the code that generates questions:

async def _get_questions(
    self, text: str, keyphrases: list[str], callbacks: Callbacks, is_async: bool
) -> t.List[str]:
    assert self.llm is not None, "LLM is not initialized"
    p_value = self._get_question_generation_prompt(text, keyphrases)
    result = await self.llm.generate(
        prompt=p_value,
        callbacks=callbacks,
        is_async=is_async,
    )

    result_text = result.generations[0][0].text
    answer = await _output_parser_question_generation.aparse(
        result_text, p_value, self.llm, self.max_retries
    )
    if answer is None:
        return []

    return answer.questions

To summarize, if the input text and keyphrases are consistent, the generated questions should ideally be consistent as well. If you still encounter variability, averaging the results of multiple runs can help stabilize the score ^[1].

To continue talking to Dosu, mention @dosu.

shahules786 commented 3 months ago

Hey @noweymik thanks for sharing your thoughts here. The idea of doing multiple runs is right, we have reproducibility parameter to few other metrics which does the same. Let me see if I can do something similar with summarisation metric as well. I am curating a list of new metrics to add to ragas in coming weeks, if you wish to share some ideas or pain ponts please feel free to comment on that as well.

noweymik commented 3 months ago

@shahules786 Thank you very much for your comments! I would really appreciate it if you could check whether we can apply the reproducibility parameter to the summarization metric as well, as you mentioned. Additionally, I have one more question: Is there a way to see which questions were generated each time the summarization score metric is run? I saw that the code receives keyphrases and questions from the LLM by prompting, but I’m curious if there’s a way to track the questions generated.

sky-2002 commented 3 months ago

Hey @noweymik, i had added this summarization metric, I will try to fix the reproducibility issue and also get back to you on the issue of tracking questions

jjmachan commented 3 months ago

@sky-2002 were you able to look at it? let us know if you need some resources/help from our side to help you out 🙂

sky-2002 commented 3 months ago

hey, i am keeping track, was busy last week, will take this up soon

explodinggradients / ragas

Ensuring Consistent Question Sets for Summarization Score Evaluation #1119