explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀
https://docs.ragas.io
Apache License 2.0
7.23k stars 742 forks source link

Can anyone suggest how can we improve answer correctness metric score on a context based dataset having question, context, ground truth and model response? #1233

Closed Anjalisoni99-byte closed 2 months ago

Anjalisoni99-byte commented 2 months ago

[ ] I checked the documentation and related resources and couldn't find an answer to my question.

Your Question what is unclear to you? What would you like to know?

Code Examples This community speaks code. Share your code snippets to help us understand your question better.

Additional context Anything else you want to share with us?

dosubot[bot] commented 2 months ago

To improve the answer correctness metric score on a context-based dataset, you can focus on enhancing both the factuality and semantic similarity aspects of the model's responses. Here are some specific steps you can take:

  1. Improve Factuality:

    • Ensure that the model's responses are factually accurate by refining the training data and incorporating more reliable sources.
    • Use the AnswerCorrectnessClassification to identify and reduce false positives (FP) and false negatives (FN) in the model's responses. Factual correctness quantifies the factual overlap between the generated answer and the ground truth answer using concepts like True Positive (TP), False Positive (FP), and False Negative (FN).
  2. Enhance Semantic Similarity:

    • Improve the semantic similarity by fine-tuning the AnswerSimilarity component. This can involve using better embeddings or more sophisticated similarity measures. Semantic similarity measures how closely the generated answer aligns with the ground truth in meaning.
  3. Optimize Weights:

    • Adjust the weights for factuality and semantic similarity to better reflect their importance in your specific use case. The default weights are [0.75, 0.25], but you can experiment with different values. The final answer correctness score is a weighted average of semantic similarity and factual similarity.
  4. Refine Prompts:

    • Improve the prompts used for generating and evaluating responses. Ensure that the CORRECTNESS_PROMPT and LONG_FORM_ANSWER_PROMPT are clear and comprehensive.
  5. Segment Sentences Effectively:

    • Ensure that the sentence segmentation is accurate by using a reliable sentence_segmenter. This helps in breaking down the text into meaningful units for evaluation.
  6. Context Precision:

    • Implement the ContextPrecision metric to evaluate whether the context provided was useful in arriving at the given answer. This involves verifying the context's relevance and utility in generating the correct answer.

Here is a code snippet that shows how you might adjust the weights and refine the prompts, along with implementing context precision:

@dataclass
class AnswerCorrectness(MetricWithLLM, MetricWithEmbeddings):
    name: str = "answer_correctness"
    evaluation_mode: EvaluationMode = EvaluationMode.qga
    correctness_prompt: Prompt = field(default_factory=lambda: CORRECTNESS_PROMPT)
    long_form_answer_prompt: Prompt = field(default_factory=lambda: LONG_FORM_ANSWER_PROMPT)
    weights: list[float] = field(default_factory=lambda: [0.8, 0.2])  # Adjusted weights
    answer_similarity: t.Optional[AnswerSimilarity] = None
    sentence_segmenter: t.Optional[HasSegmentMethod] = None
    max_retries: int = 1

    def __post_init__(self: t.Self):
        if len(self.weights) != 2:
            raise ValueError("Expects a list of two weights. First for factuality, second for semantic similarity")
        if all([w == 0 for w in self.weights]):
            raise ValueError("At least one weight must be non-zero")
        if not all([w >= 0 for w in self.weights]):
            raise ValueError("Weights must be non-negative")

        if self.sentence_segmenter is None:
            language = self.long_form_answer_prompt.language
            self.sentence_segmenter = get_segmenter(language=language, clean=False)

@dataclass
class ContextPrecision(MetricWithLLM):
    name: str = "context_precision"
    evaluation_mode: EvaluationMode = EvaluationMode.qcg
    context_precision_prompt: Prompt = field(default_factory=lambda: CONTEXT_PRECISION)
    max_retries: int = 1
    _reproducibility: int = 1

    @property
    def reproducibility(self):
        return self._reproducibility

    @reproducibility.setter
    def reproducibility(self, value):
        if value < 1:
            logger.warning("reproducibility cannot be less than 1, setting to 1")
            value = 1
        elif value % 2 == 0:
            logger.warning("reproducibility level cannot be set to even number, setting to odd")
            value += 1
        self._reproducibility = value

    def _get_row_attributes(self, row: t.Dict) -> t.Tuple[str, t.List[str], t.Any]:
        return row["question"], row["contexts"], row["ground_truth"]

    def _context_precision_prompt(self, row: t.Dict) -> t.List[PromptValue]:
        question, contexts, answer = self._get_row_attributes(row)
        return [
            self.context_precision_prompt.format(
                question=question, context=c, answer=answer
            )
            for c in contexts
        ]

    def _calculate_average_precision(
        self, verifications: t.List[ContextPrecisionVerification]
    ) -> float:
        score = np.nan
        verdict_list = [1 if ver.verdict else 0 for ver in verifications]
        denominator = sum(verdict_list) + 1e-10
        numerator = sum(
            [
                (sum(verdict_list[: i + 1]) / (i + 1)) * verdict_list[i]
                for i in range(len(verdict_list))
            ]
        )
        score = numerator / denominator
        if np.isnan(score):
            logger.warning(
                "Invalid response format. Expected a list of dictionaries with keys 'verdict'"
            )
        return score

    async def _ascore(
        self: t.Self,
        row: t.Dict,
        callbacks: Callbacks,
    ) -> float:
        assert self.llm is not None, "LLM is not set"

        human_prompts = self._context_precision_prompt(row)
        responses = []
        for hp in human_prompts:
            results = await self.llm.generate(
                hp,
                callbacks=callbacks,
                n=self.reproducibility,
            )
            results = [
                await _output_parser.aparse(item.text, hp, self.llm, self.max_retries)
                for item in results.generations[0]
            ]

            responses.append(
                [result.dict() for result in results if result is not None]
            )

        answers = []
        for response in responses:
            agg_answer = ensembler.from_discrete([response], "verdict")
            if agg_answer:
                agg_answer = ContextPrecisionVerification.parse_obj(agg_answer[0])
            answers.append(agg_answer)

        answers = ContextPrecisionVerifications(__root__=answers)
        score = self._calculate_average_precision(answers.__root__)
        return score

    def adapt(self, language: str, cache_dir: str | None = None) -> None:
        assert self.llm is not None, "LLM is not set"

        logging.info(f"Adapting Context Precision to {language}")
        self.context_precision_prompt = self.context_precision_prompt.adapt(
            language, self.llm, cache_dir
        )

    def save(self, cache_dir: str | None = None) -> None:
        self.context_precision_prompt.save(cache_dir)

By focusing on these areas, you can systematically improve the answer correctness metric score for your context-based dataset [1][2][3].

To continue talking to Dosu, mention @dosu.

Anjalisoni99-byte commented 2 months ago

If we use domain specific evaluation metrics indicating a score of 1-5 , 1 being least relevant and 5 being most relevant to the query then counting queries having score in the range(1-3) . How can we improve answer correctness using this approach ?

dosubot[bot] commented 2 months ago

To improve answer correctness for queries that score between 1 and 3 using domain-specific evaluation metrics, you can focus on enhancing the factuality and semantic similarity of the answers. The AnswerCorrectness class in the provided code already measures correctness based on these two aspects. Here’s how you can specifically target and improve the lower-scoring queries:

  1. Identify Low-Scoring Queries: Filter out the queries that have scores between 1 and 3.
  2. Analyze Misclassifications: Use the AnswerCorrectnessClassification to identify True Positives (TP), False Positives (FP), and False Negatives (FN) in the answers.
  3. Refine Answer Generation: Adjust the answer generation process to reduce FPs and FNs by ensuring that the answers are more aligned with the ground truth.

Here is a code snippet to illustrate how you might implement this:

def improve_low_scoring_answers(queries, answers, ground_truths, threshold_low=1, threshold_high=3):
    improved_answers = []
    for query, answer, ground_truth in zip(queries, answers, ground_truths):
        # Compute the initial score
        initial_score = answer_correctness._ascore({
            "question": query,
            "answer": answer,
            "ground_truth": ground_truth
        }, callbacks=None)

        if threshold_low <= initial_score <= threshold_high:
            # Analyze the misclassifications
            classification = AnswerCorrectnessClassification.parse_obj(
                answer_correctness.correctness_prompt.format(
                    question=query,
                    ground_truth=ground_truth,
                    answer=answer
                )
            )

            # Improve the answer by addressing FPs and FNs
            improved_answer = answer
            for fn in classification.FN:
                improved_answer.append(fn['statement'])
            for fp in classification.FP:
                improved_answer.remove(fp['statement'])

            improved_answers.append(improved_answer)
        else:
            improved_answers.append(answer)

    return improved_answers

# Example usage
queries = ["What powers the sun and what is its primary function?", "What is the boiling point of water?"]
answers = [
    ["The sun is powered by nuclear fission, similar to nuclear reactors on Earth.", "The primary function of the sun is to provide light to the solar system."],
    ["The boiling point of water is 100 degrees Celsius at sea level"]
]
ground_truths = [
    ["The sun is powered by nuclear fusion, where hydrogen atoms fuse to form helium.", "This fusion process in the sun's core releases a tremendous amount of energy.", "The energy from the sun provides heat and light, which are essential for life on Earth.", "The sun's light plays a critical role in Earth's climate system.", "Sunlight helps to drive the weather and ocean currents."],
    ["The boiling point of water is 100 degrees Celsius (212 degrees Fahrenheit) at sea level.", "The boiling point of water can change with altitude."]
]

improved_answers = improve_low_scoring_answers(queries, answers, ground_truths)
print(improved_answers)

This approach ensures that the answers are iteratively improved by focusing on the specific areas where they fall short, thus enhancing the overall correctness for queries that initially score between 1 and 3. The evaluation of answer correctness involves gauging the accuracy of the generated answer when compared to the ground truth, using a combination of factual and semantic similarity. This is quantified using True Positives (TP), False Positives (FP), and False Negatives (FN), and the final score is a weighted average of these aspects.

Additionally, you can leverage domain-specific evaluation metrics, such as those provided by the ragas framework, to further refine the evaluation process. The ragas framework allows you to define custom rubrics for scoring, which can be tailored to the specific requirements of your domain. This can help in providing more accurate and relevant evaluations, thereby guiding the improvement process more effectively [1][2][3][4].

To continue talking to Dosu, mention @dosu.

Anjalisoni99-byte commented 2 months ago

If we set a threshold 0.7 and count the queries having lower correctness value than threshold as a initial view to improving answer correctness? Will it be a good approach?

dosubot[bot] commented 2 months ago

Yes, setting a threshold (e.g., 0.7) and focusing on queries with correctness values below this threshold is a good approach to systematically improve answer correctness. This method allows you to identify and prioritize the queries that need the most attention. Here’s how you can implement this approach:

  1. Set the Threshold: Define a threshold value for correctness (e.g., 0.7).
  2. Filter Low-Scoring Queries: Identify queries with correctness scores below the threshold.
  3. Analyze and Improve: Use the AnswerCorrectnessClassification to analyze True Positives (TP), False Positives (FP), and False Negatives (FN) and refine the answers accordingly.

Here is a code snippet to illustrate this process:

def improve_low_scoring_answers(queries, answers, ground_truths, threshold=0.7):
    improved_answers = []
    for query, answer, ground_truth in zip(queries, answers, ground_truths):
        # Compute the initial score
        initial_score = answer_correctness._ascore({
            "question": query,
            "answer": answer,
            "ground_truth": ground_truth
        }, callbacks=None)

        if initial_score < threshold:
            # Analyze the misclassifications
            classification = AnswerCorrectnessClassification.parse_obj(
                answer_correctness.correctness_prompt.format(
                    question=query,
                    ground_truth=ground_truth,
                    answer=answer
                )
            )

            # Improve the answer by addressing FPs and FNs
            improved_answer = answer
            for fn in classification.FN:
                improved_answer.append(fn['statement'])
            for fp in classification.FP:
                improved_answer.remove(fp['statement'])

            improved_answers.append(improved_answer)
        else:
            improved_answers.append(answer)

    return improved_answers

# Example usage
queries = ["What powers the sun and what is its primary function?", "What is the boiling point of water?"]
answers = [
    ["The sun is powered by nuclear fission, similar to nuclear reactors on Earth.", "The primary function of the sun is to provide light to the solar system."],
    ["The boiling point of water is 100 degrees Celsius at sea level"]
]
ground_truths = [
    ["The sun is powered by nuclear fusion, where hydrogen atoms fuse to form helium.", "This fusion process in the sun's core releases a tremendous amount of energy.", "The energy from the sun provides heat and light, which are essential for life on Earth.", "The sun's light plays a critical role in Earth's climate system.", "Sunlight helps to drive the weather and ocean currents."],
    ["The boiling point of water is 100 degrees Celsius (212 degrees Fahrenheit) at sea level.", "The boiling point of water can change with altitude."]
]

improved_answers = improve_low_scoring_answers(queries, answers, ground_truths, threshold=0.7)
print(improved_answers)

This approach ensures that you systematically improve the answers by focusing on the specific areas where they fall short, thus enhancing the overall correctness for queries that initially score below the threshold [1][2].

To continue talking to Dosu, mention @dosu.