explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
6.47k stars 629 forks source link

[R-279] [R-280] Summarization Score Formula is unreasonable #1108

Closed ofooo closed 1 week ago

ofooo commented 1 month ago

I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug I read the doc for summarization Score. https://docs.ragas.io/en/stable/concepts/metrics/summarization_score.html

1

QA score, higher is better. range is (0, 1).

but, consiseness score, lower is better. range is (0, infinity).

so, summarization score = (QA score + conciseness score)/2, It's unreasonable.

I think the "consiseness score" should be converted into a valut that range is (0, 1), and the larger the better.

Ragas version: stable

R-280

dosubot[bot] commented 1 month ago

Hey @ofooo! I'm here to help you with your bug and answer any questions you have while we wait for a human maintainer. Let's solve this together!

To address the issue with the Summarization Score formula, you can adjust the conciseness score to fit within the 0 to 1 range and ensure that a higher score is better. The current implementation in the Ragas framework already does this. Here is the relevant code:

def _compute_conciseness_score(self, text, summary) -> float:
    """Returns the conciseness score of the summary. This is calculated as
    (1- relative_length_of_summary), where relative_length_of_summary is the
    ratio of the length of the summary to the length of the original text.
    This promotes shorter summaries.
    """
    return 1 - (len(summary) / len(text))

This method calculates the conciseness score as 1 - (len(summary) / len(text)), ensuring that the score is within the 0 to 1 range and that a higher score indicates a more concise summary [1].

Additionally, the Summarization Score is calculated as the average of the QA score and the conciseness score:

def _compute_score(self, scores) -> float:
    return sum(scores) / len(scores)

These scores are combined in the _ascore method:

async def _ascore(self, row: Dict, callbacks: Callbacks, is_async: bool) -> float:
    text: str = "\n".join(row["contexts"])
    summary: str = row["summary"]
    keyphrases = await self._extract_keyphrases(text, callbacks, is_async)
    questions = await self._get_questions(text, keyphrases, callbacks, is_async)
    answers = await self._get_answers(questions, summary, callbacks, is_async)

    scores = []
    qa_score = self._compute_qa_score(answers)
    scores.append(qa_score)
    if self.length_penalty:
        conciseness_score = self._compute_conciseness_score(text, summary)
        scores.append(conciseness_score)
    return self._compute_score(scores)
}

This ensures that both the QA score and the conciseness score are within the 0 to 1 range, making the Summarization Score meaningful and balanced [1].

To continue talking to Dosu, mention @dosu.

shahules786 commented 1 month ago

Hey @ofooo good point. I will take care of it this week.

sky-2002 commented 3 weeks ago

Hey @ofooo , you are absoultely correct, In the code, its correct, I missed this in docs, I have updated the docs But yes, we need to handle cases where summary length is higher than text length, we can straightaway assign 0 score maybe image

shahules786 commented 3 weeks ago

Thanks for pitching in @sky-2002 and pointing out the issue @ofooo @sky-2002 can you fix that edge case in the summarisation metric? Another point I would like to make is summaries can be hard to score in general - even humans struggle with it. In coming versions, we will add support to ranking based metrics (it's easier to rank such tasks)

sky-2002 commented 3 weeks ago

@shahules786 interesting points you made, we can discuss these in the above PR I created or on discord, there are more points to discuss on summarization, @ofooo you can also join the discussion in the PR, suggestions are welcome

shahules786 commented 3 weeks ago

Hey @ofooo @sky-2002 An easy and intuitive fix for this is modify conciseness score as

conciseness score = min(length of summary, length of context) / (length of context + 1) thereby mapping into range of (0,1)

Then since a conciseness score, lower is better , we take (1 - consiseness score) when combining it with the QA score

So final will be = [QA Score + (1 - consiseness_score) ] / 2

One more suggestion, here we have assumed both to be QA and consiness to be equal weightage, but ideally user should be able to control it. So adding an extra argument as coeff (which is 0-1) , the score would be

score = coeff * QA Score + (1-coeff) *(1 - consiseness_score)

How does that look guys?

sky-2002 commented 3 weeks ago

Great idea @shahules786 , I had proposed this weighing initially but we didnt want too much work on the user side, but anyways, its good to let user control it, fixing it right away