explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀
https://docs.ragas.io
Apache License 2.0
6.98k stars 700 forks source link

using the embeddings #1509

Open amin-kh96 opened 1 week ago

amin-kh96 commented 1 week ago

I have been working with RAGAS for a while, I have a special case that needs an opinion for this. I already have a dataset that contains the embeddings for context, answer, and the question. and I noticed that the evaluate function uses textual data for running the evaluation is there any way that I can use the embeddings instead of textual data, if I customize the evaluate function then I have to also customize the metrics that I want to use. if I am correct it is gonna be too time-consuming and I do not want to pay for an API key to generate the embeddings that I already have.

dosubot[bot] commented 1 week ago

You can use embeddings instead of textual data in the RAGAS evaluate function without extensive customization. You can pass your precomputed embeddings directly to the evaluate function using the embeddings parameter. This allows you to bypass the need for generating new embeddings or using an API key.

Here's how you can do it:

  1. Prepare Your Precomputed Embeddings: Ensure your embeddings are structured in a way that the evaluate function can use. This typically involves implementing a class that mimics the expected interface, such as having methods like embed_documents and embed_query.

  2. Pass Precomputed Embeddings to Evaluate: When calling the evaluate function, pass your precomputed embeddings directly:

    # Assuming `precomputed_embeddings` is an instance of your custom embeddings class
    evaluation_report = evaluate(ragas_data, metrics=metrics, embeddings=precomputed_embeddings)

By doing this, the evaluate function will use your provided embeddings and will not invoke the default embedding generation process, thus avoiding recomputation of embeddings [1][2].

To continue talking to Dosu, mention @dosu.

amin-kh96 commented 1 week ago

@dosu. no you are wrong do not give me wrong answer

dosubot[bot] commented 1 week ago

It looks like this issue needs your specialized knowledge, @jjmachan, could you please step in and provide your guidance?

To continue talking to Dosu, mention @dosu.

jjmachan commented 6 days ago

hey @amin-kh96 that is a very interesting usecase!

while you don't have to customize the metrics, you will have to write your own https://docs.ragas.io/en/stable/references/embeddings/#ragas.embeddings.BaseRagasEmbeddings

this means each time ragas tried to compute the embedding for something it will hit you're vector store instead of the cloud. You can also write it so that it will compute the embedding if none is found.

I can help you implement this too but this should work, let me know if you have any doubts? cheers 🙂

amin-kh96 commented 5 days ago

@jjmachan Thanks but what about the evaluate function that takes textual data only? what can I do for that? I already created baseragsembeddings and baseragasLLM to bypass the error for APIkey.

amin-kh96 commented 5 days ago

@jjmachan This is my code. I created the subclass of baseragsembeddings and basaragsLLM. but the problem now is that I already paid for the embeddings to be generated but the evaluate function force me to use textual data. please take a look at my code.

import json
import numpy as np
from datasets import Dataset
import typing as t
import torch

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import euclidean

from langchain_core.outputs import LLMResult, Generation
from langchain_core.outputs import LLMResult
from transformers import AutoModel, AutoTokenizer

from ragas.llms.prompt import PromptValue
from ragas.embeddings import BaseRagasEmbeddings
from ragas.metrics import context_utilization
from ragas.llms import BaseRagasLLM
from ragas.llms.prompt import PromptValue
from langchain_core.outputs import LLMResult, Generation
from ragas import evaluate

# Load the ground truth data
file_path = 'C:\\Users\\Amin\\git\\ragas-prototype\\src\\assets\\GT.json'
with open(file_path) as f:
    ground_truth_data = json.load(f)

# Load the question and the answer and the chunks
file_path = 'C:\\Users\\Amin\\git\\ragas-prototype\\src\\assets\\user_llm_interaction_embeddings_c1521dd5_b819_4241_b3a4_3e5c1388037c.json'
with open(file_path) as f:
    llm = json.load(f)

# Initialize an empty list to hold the new dataset
data_set = []

# Iterate through the list and combine every two dictionaries
for i in range(0, len(llm), 2):
    combined_dict = {
        "text_vector_1": llm[i].get("text_vector", []),
        "text_vector_2": llm[i + 1].get("text_vector", []),
        'chunks': llm[i + 1].get('chunks', [])
    }
    data_set.append(combined_dict)

def map_chunks(data_set, ground_truth_data):
    for item in data_set:  # Iterate over each dictionary in data_set
        c = []  # Reset c for each item
        for chunk_id in item['chunks']:  # Loop through 'chunks' in the current dictionary
            for element in ground_truth_data:  # Loop through ground_truth_data
                if element['id'] == chunk_id:  # Match chunk_id with element's id
                    c.append(element['text_vector'])  # Append the matching text_vector to c
        item['chunks'] = c  # Replace the original 'chunks' (ids) with the mapped text_vector values

    return data_set  # Return the updated data_set

data_set = map_chunks(data_set, ground_truth_data)

# Assuming data_set is a list of dictionaries
ragas_data = [
    {
        "question": entry["text_vector_1"],  # Assuming this is a list of strings
        "answer": entry["text_vector_2"],  # Assuming this is a list of strings
        "contexts": entry["chunks"]  # Assuming this is a list of lists of strings
    }
    for entry in data_set
]

# Create the required structure that  structures the data for the Hugging Face Dataset creation.
formatted_data = {
    "question": [entry["question"] for entry in ragas_data],
   "contexts": [entry["contexts"] for entry in ragas_data],
    "answer": [entry["answer"] for entry in ragas_data]
}

# Define the column_map to match custom columns to expected ones
column_map = {
    "question": "question",  # Match to the correct column
    "answer": "answer",      # Match to the correct column
    "contexts": "contexts"   # Match to the correct column
}

# Create a Dataset using the Hugging Face `datasets` library
ragas_dataset = Dataset.from_dict(formatted_data)

model_name = 'distilbert-base-uncased'

class CustomHuggingFaceRagasEmbeddings(BaseRagasEmbeddings):
    def __init__(self, model_name: str, custom_embeddings: list = None):
        """
        Initialize the Custom Hugging Face Ragas Embeddings with the specified model and custom embeddings.

        Parameters:
            model_name (str): The name of the Hugging Face model to use (e.g., 'distilbert-base-uncased').
            custom_embeddings (list): A list of pre-computed custom embeddings (optional).
        """
        model_name = 'distilbert-base-uncased'
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
        self.custom_embeddings = custom_embeddings  # Store the custom embeddings

    def embed_documents(self, texts: list) -> np.ndarray:
        """
        Generate embeddings for a list of documents.

        Parameters:
            texts (list): A list of documents to embed.

        Returns:
            np.ndarray: An array of embeddings for the documents.
        """
        if self.custom_embeddings is not None:
            # If custom embeddings are provided, return those instead
            return np.array(self.custom_embeddings)

        # Generate new embeddings using the model if no custom embeddings are available
        inputs = self.tokenizer(texts, return_tensors='pt', padding=True, truncation=True)

        with torch.no_grad():
            outputs = self.model(**inputs)

        # Use the pooled output or the CLS token as the embedding
        embeddings = outputs.last_hidden_state[:, 0, :]  # CLS token for sentence embedding
        return embeddings.numpy()  # Convert to NumPy array

    def embed_query(self, query: str) -> np.ndarray:
        """
        Generate an embedding for a single query.

        Parameters:
            query (str): The query to embed.

        Returns:
            np.ndarray: The embedding for the query.
        """
        # If custom embeddings are provided, generate embedding based on those
        if self.custom_embeddings is not None:
            # You might want to handle how to relate the query to your custom embeddings
            raise NotImplementedError("Custom query embeddings are not supported with provided custom embeddings.")

        # Generate a new embedding using the model
        inputs = self.tokenizer(query, return_tensors='pt', padding=True, truncation=True)

        with torch.no_grad():
            outputs = self.model(**inputs)

        # Use the pooled output or the CLS token as the embedding
        embedding = outputs.last_hidden_state[:, 0, :]  # CLS token for single query embedding
        return embedding.numpy()  # Convert to NumPy array

# Initialize the custom embeddings class
custom_embeddings = CustomHuggingFaceRagasEmbeddings(ragas_dataset)

ragas_embeddings = CustomHuggingFaceRagasEmbeddings(model_name=model_name, custom_embeddings=custom_embeddings)

# Define the custom LLM class
class CustomRagasLLM(BaseRagasLLM):
    def __init__(self, api_key: str = None):
        """
        Initialize the custom LLM, optionally using an API key if necessary.
        """
        self.api_key = api_key

    def _call(self, prompt: str) -> str:
        """
        Process the prompt and return a result. This can be customized to
        use a local model or perform any required logic.
        """
        if not self.api_key:
            return f"Processed: {prompt} (without API key)"
        else:
            # Handle LLM response if using an API
            return f"Processed: {prompt} (with API key: {self.api_key})"

    def generate_text(
        self,
        prompt: PromptValue,
        n: int = 1,
        temperature: float = 1e-8,
        stop: t.Optional[t.List[str]] = None,
        callbacks: t.List = []
    ) -> LLMResult:
        # Synchronous generation logic
        text = self._call(prompt)
        return LLMResult(generations=[Generation(text=text)])
    async def agenerate_text(
        self,
        prompt: PromptValue,
        n: int = 1,
        temperature: float = 1e-8,
        stop: t.Optional[t.List[str]] = None,
        callbacks: t.List = []
    ) -> LLMResult:
        """
        Asynchronous method to generate text. This should allow for async processing.
        """
        # Simulate an asynchronous call, here we directly call the sync method for now
        text =await self._call(prompt)
        return LLMResult(generations=[Generation(text=text)])     

custom_llm =CustomRagasLLM(api_key=None) 

please note that i needed the result so temporarily I used the euclidian distance and cosine similarity for evaluation. just ignore that part. I wanna use evaluate function with the dataset that consists of embeddings for question, answer, and context.

jjmachan commented 5 days ago

@amin-kh96 I don't understand what the problem is - please make it clear and short

Thanks but what about the evaluate function that takes textual data only? what can I do for that?

we only support textual data, so I don't understand what the question is. Do put a bit more effort into writing the comments here so that we can be more effective in helping you. I hope you understand 🙂

amin-kh96 commented 5 days ago

@jjmachan I already have the embeddings(for question, context, and answer). I want to use RAGAS with the embeddings that I have. but my problem is that the "evaluate" function forces me to use str format (which means textual data). do you have any solution that lets the "evaluate" function support embeddings rather than textual data? if it is not possible what approach do you suggest to me?

jjmachan commented 4 days ago

@amin-kh96 that approach will not work

what will work is this

hey @amin-kh96 that is a very interesting usecase!

while you don't have to customize the metrics, you will have to write your own https://docs.ragas.io/en/stable/references/embeddings/#ragas.embeddings.BaseRagasEmbeddings

this means each time ragas tried to compute the embedding for something it will hit you're vector store instead of the cloud. You can also write it so that it will compute the embedding if none is found.

I can help you implement this too but this should work, let me know if you have any doubts? cheers 🙂

implement a custom BaseRagasEmbedding so that when evaluate() calls embed_text it checks the embedding for that text from where ever you have and returns it. if not found it will fetch from openai or anyother service