explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
6.54k stars 640 forks source link

Can Custom Metrics be added? #1205

Open LDelPinoNT opened 3 weeks ago

LDelPinoNT commented 3 weeks ago

[x] I checked the documentation and related resources and couldn't find an answer to my question.

Your Question what is unclear to you? What would you like to know? Can I add custom metrics? I didn't find information related to it.

Additional context Anything else you want to share with us? I'm thinking in creating a metric to reflect how many times the input and the response are in the same language.

Thank you!

dosubot[bot] commented 3 weeks ago

Hello @LDelPinoNT! I'm here to help you with any questions or issues you have. Let's get started!

Yes, you can add custom metrics to reflect how many times the input and the response are in the same language. You can create a new metric class that checks the language of both the input and the response and counts the matches. Here is an example of how you can implement this:

from langchain_core.pydantic_v1 import BaseModel, Field
from ragas.metrics.base import MetricWithLLM, EvaluationMode
from langdetect import detect

class LanguageMatchAnswer(BaseModel):
    input_language: str = Field(..., description="Language of the input")
    response_language: str = Field(..., description="Language of the response")
    match: int = Field(..., description="1 if languages match, 0 otherwise")

class LanguageMatchAnswers(BaseModel):
    __root__: t.List[LanguageMatchAnswer]

    def dicts(self) -> t.List[t.Dict]:
        return self.dict()["__root__"]

@dataclass
class LanguageMatch(MetricWithLLM):
    name: str = "language_match"
    evaluation_mode: EvaluationMode = EvaluationMode.qac

    def _detect_language(self, text: str) -> str:
        return detect(text)

    def _compute_score(self, answers: LanguageMatchAnswers):
        matches = sum(answer.match for answer in answers.__root__)
        num_answers = len(answers.__root__)
        if num_answers:
            score = matches / num_answers
        else:
            logger.warning("No answers were provided.")
            score = np.nan
        return score

    async def _ascore(self: t.Self, row: t.Dict, callbacks: Callbacks) -> float:
        input_text = row["question"]
        response_text = row["answer"]

        input_language = self._detect_language(input_text)
        response_language = self._detect_language(response_text)

        match = 1 if input_language == response_language else 0

        answers = LanguageMatchAnswers.parse_obj(
            [{"input_language": input_language, "response_language": response_language, "match": match}]
        )

        return self._compute_score(answers)

language_match = LanguageMatch()

This code defines a new metric LanguageMatch that detects the language of the input and the response and checks if they match. The _compute_score method calculates the proportion of matches.

Additionally, you can customize which LLMs and embeddings to use for this metric by leveraging the Ragas framework. You can pass any instance of BaseRagasLLM or BaseRagasEmbeddings directly to the metric or the evaluate function. For example:

from langchain_openai.chat_models import ChatOpenAI
from langchain_openai.embeddings import OpenAIEmbeddings
from ragas import evaluate

openai_key = "<your-key-here>"

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002", openai_api_key=openai_key)
openai_completion = ChatOpenAI(
    model="gpt-3.5-turbo",
    temperature=0.7,
    max_tokens=4000,
    top_k=1,
    openai_api_key=openai_key
)

# Attach the LLM and embeddings to the metric
language_match.llm = openai_completion
language_match.embeddings = embeddings

# Evaluate using the custom metric
results = evaluate(metrics=[language_match], llm=openai_completion, embeddings=embeddings)

This allows you to use custom LLMs and embeddings for your LanguageMatch metric, ensuring that the evaluation is tailored to your specific needs [1][2][3][4][5].

To continue talking to Dosu, mention @dosu.

LDelPinoNT commented 3 weeks ago

@dosu can you explain me what is "t" in the LanguageMatchAnswers class?

dosubot[bot] commented 3 weeks ago

In the LanguageMatchAnswers class, t is an alias for the typing module, which provides support for type hints in Python. It is commonly used to specify the types of variables, function arguments, and return values.

Here's a breakdown of how t is used in the LanguageMatchAnswers class:

from typing import List, Dict  # Importing the necessary types from the typing module

class LanguageMatchAnswers(BaseModel):
    __root__: List[LanguageMatchAnswer]  # Using List from the typing module to specify a list of LanguageMatchAnswer objects

    def dicts(self) -> List[Dict]:  # Using List and Dict from the typing module to specify the return type
        return self.dict()["__root__"]

In this context:

By using t as an alias for typing, the code becomes more concise and readable.

To continue talking to Dosu, mention @dosu.

LDelPinoNT commented 3 weeks ago

Hey @dosu can you tell me which modules should I import for the "logger" object and the "Callbacks" in the "LanguageMatch" class ?

dosubot[bot] commented 3 weeks ago

You should import the logging module for the "logger" object and the BaseCallbackHandler from langchain_core.callbacks.base for the "Callbacks" in the "LanguageMatch" class.

Here are the relevant imports:

import logging
from langchain_core.callbacks.base import BaseCallbackHandler

Supporting evidence from the document:

To continue talking to Dosu, mention @dosu.

LDelPinoNT commented 3 weeks ago

For the record the final, fixed code: ` from langchain_core.pydantic_v1 import BaseModel, Field from ragas.metrics.base import MetricWithLLM, EvaluationMode from langchain.callbacks.base import Callbacks from langdetect import detect import typing as t from dataclasses import dataclass import logging import numpy as np from ragas.evaluation import RunConfig from datasets import load_dataset import nest_asyncio

class LanguageMatchAnswer(BaseModel): input_language: str = Field(..., description="Language of the input") response_language: str = Field(..., description="Language of the response") match: int = Field(..., description="1 if languages match, 0 otherwise")

class LanguageMatchAnswers(BaseModel): root: t.List[LanguageMatchAnswer]

def dicts(self) -> t.List[t.Dict]:
    return self.dict()["__root__"]

@dataclass class LanguageMatch(MetricWithLLM): name: str = "language_match" evaluation_mode: EvaluationMode = EvaluationMode.qac

def _detect_language(self, text: str) -> str:
    return detect(text)

def _compute_score(self, answers: LanguageMatchAnswers):
    matches = sum(answer.match for answer in answers.__root__)
    num_answers = len(answers.__root__)
    if num_answers:
        score = matches / num_answers
    else:
        logger = logging.getLogger(__name__)
        logger.warning("No answers were provided.")
        score = np.nan
    return score

async def _ascore(self: t.Self, row: t.Dict, callbacks: Callbacks) -> float:
    input_text = row["question"]
    response_text = row["answer"]

    input_language = self._detect_language(input_text)
    response_language = self._detect_language(response_text)

    match = 1 if input_language == response_language else 0

    answers = LanguageMatchAnswers.parse_obj(
        [{"input_language": input_language, "response_language": response_language, "match": match}]
    )

    return self._compute_score(answers)

language_match = LanguageMatch()

nest_asyncio.apply()

[insert your code for llm, embedding funcs and runconfig]

result = evaluate( ds, metrics=[ language_match ], llm=llm, embeddings=emb_fun, run_config=runconfig_cofig ) `