explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀
https://docs.ragas.io
Apache License 2.0
6.89k stars 689 forks source link

subclass of LLM #1439

Open amin-kh96 opened 1 week ago

amin-kh96 commented 1 week ago

I wrote this code and I am using a subclass of LLM and BASERAGASEMBEDDINGS. I already have the embeddings and I face the error to set the API key.to solve that since I have all the requirements I created a sub class of baseragasLLM, but I face this error, and the thing is that I already defined what it needed in my subclass, please take a look at my code and my error and help me to fix it.

the code: import json from transformers import AutoModel, AutoTokenizer import torch import numpy as np from datasets import Dataset from ragas.embeddings import BaseRagasEmbeddings from ragas.metrics import context_utilization,ContextUtilization from ragas.llms import BaseRagasLLM import asyncio from ragas import evaluate

Load the ground truth data

file_path = 'C:\Users\Amin\OneDrive - unige.it\Documenti\projectss\ragas-prototype\src\assets\GT.json' with open(file_path) as f: ground_truth_data = json.load(f)

Load the question and the answer and the chunks

file_path = 'C:\Users\Amin\OneDrive - unige.it\Documenti\projectss\ragas-prototype\src\assets\user_llm_interaction_embeddings_c1521dd5_b819_4241_b3a4_3e5c1388037c.json' with open(file_path) as f: llm = json.load(f)

Initialize an empty list to hold the new dataset

data_set = []

Iterate through the list and combine every two dictionaries

for i in range(0, len(llm), 2): combined_dict = { "text_vector_1": llm[i].get("text_vector", []), "text_vector_2": llm[i + 1].get("text_vector", []), 'chunks': llm[i + 1].get('chunks', []) } data_set.append(combined_dict)

def map_chunks(data_set, ground_truth_data): for item in data_set: # Iterate over each dictionary in data_set c = [] # Reset c for each item for chunk_id in item['chunks']: # Loop through 'chunks' in the current dictionary for element in ground_truth_data: # Loop through ground_truth_data if element['id'] == chunk_id: # Match chunk_id with element's id c.append(element['text_vector']) # Append the matching text_vector to c item['chunks'] = c # Replace the original 'chunks' (ids) with the mapped text_vector values

return data_set  # Return the updated data_set

data_set = map_chunks(data_set, ground_truth_data)

Assuming data_set is a list of dictionaries

ragas_data = [ { "question": entry["text_vector_1"], # Assuming this is a list of strings "answer": entry["text_vector_2"], # Assuming this is a list of strings "contexts": entry["chunks"] # Assuming this is a list of lists of strings } for entry in data_set ]

Create the required structure for Dataset

formatted_data = { "question": [entry["question"] for entry in ragas_data], "contexts": [entry["contexts"] for entry in ragas_data], "answer": [entry["answer"] for entry in ragas_data] }

model_name = 'distilbert-base-uncased'

class CustomHuggingFaceRagasEmbeddings(BaseRagasEmbeddings): def init(self, model_name: str, custom_embeddings: list = None): """ Initialize the Custom Hugging Face Ragas Embeddings with the specified model and custom embeddings.

    Parameters:
        model_name (str): The name of the Hugging Face model to use (e.g., 'distilbert-base-uncased').
        custom_embeddings (list): A list of pre-computed custom embeddings (optional).
    """
    model_name = 'distilbert-base-uncased'
    self.tokenizer = AutoTokenizer.from_pretrained(model_name)
    self.model = AutoModel.from_pretrained(model_name)
    self.custom_embeddings = custom_embeddings  # Store the custom embeddings

def embed_documents(self, texts: list) -> np.ndarray:
    """
    Generate embeddings for a list of documents.

    Parameters:
        texts (list): A list of documents to embed.

    Returns:
        np.ndarray: An array of embeddings for the documents.
    """
    if self.custom_embeddings is not None:
        # If custom embeddings are provided, return those instead
        return np.array(self.custom_embeddings)

    # Generate new embeddings using the model if no custom embeddings are available
    inputs = self.tokenizer(texts, return_tensors='pt', padding=True, truncation=True)

    with torch.no_grad():
        outputs = self.model(**inputs)

    # Use the pooled output or the CLS token as the embedding
    embeddings = outputs.last_hidden_state[:, 0, :]  # CLS token for sentence embedding
    return embeddings.numpy()  # Convert to NumPy array

def embed_query(self, query: str) -> np.ndarray:
    """
    Generate an embedding for a single query.

    Parameters:
        query (str): The query to embed.

    Returns:
        np.ndarray: The embedding for the query.
    """
    # If custom embeddings are provided, generate embedding based on those
    if self.custom_embeddings is not None:
        # You might want to handle how to relate the query to your custom embeddings
        raise NotImplementedError("Custom query embeddings are not supported with provided custom embeddings.")

    # Generate a new embedding using the model
    inputs = self.tokenizer(query, return_tensors='pt', padding=True, truncation=True)

    with torch.no_grad():
        outputs = self.model(**inputs)

    # Use the pooled output or the CLS token as the embedding
    embedding = outputs.last_hidden_state[:, 0, :]  # CLS token for single query embedding
    return embedding.numpy()  # Convert to NumPy array

Initialize the custom embeddings class

custom_embeddings = CustomHuggingFaceRagasEmbeddings(ragas_data)

ragas_embeddings = CustomHuggingFaceRagasEmbeddings(model_name=model_name, custom_embeddings=custom_embeddings)

Define the custom LLM class

class CustomRagasLLM(BaseRagasLLM): def init(self, api_key: str = None): """ Initialize the custom LLM, optionally using an API key if necessary. """ self.api_key = api_key

def _call(self, prompt: str) -> str:
    """
    Process the prompt and return a result. This can be customized to
    use a local model or perform any required logic.
    """
    if not self.api_key:
        return f"Processed: {prompt} (without API key)"
    else:
        # Handle LLM response if using an API
        return f"Processed: {prompt} (with API key: {self.api_key})"

# Implementation of abstract method 'generate_text' (synchronous)
def generate_text(self, prompt: str) -> str:
    """
    Synchronously generate text for a given prompt.
    """
    return self._call(prompt)

# Implementation of abstract method 'agenerate_text' (asynchronous)
async def agenerate_text(self, prompt: str) -> str:
    """
    Asynchronously generate text for a given prompt.
    """
    return self._call(prompt)

Initialize the custom LLM class (set api_key=None if you don't need it)

custom_llm = CustomRagasLLM(api_key=None)

Define the evaluation metrics

metrics = [context_utilization]

lets define a custom function of evaluate function

def custom_evaluate(ragas_data, metrics, embeddings: BaseRagasEmbeddings):

""" Custom evaluation function that avoids using OpenAI API.

Parameters:
    dataset: The dataset to evaluate.
    metrics: A list of metrics to evaluate.
    embeddings: A custom embedding model (subclass of BaseRagasEmbeddings).

Returns:
    A dictionary of evaluation results.
"""

results = {}

# Iterate over the metrics and evaluate
#for metric in metrics:
    #try:
        # Make sure the metric is compatible with the custom embeddings
       # if isinstance(metric, ContextUtilization):  # Check for specific metric type
            #result = evaluate(ragas_data, metric, embeddings=embeddings)
    #        results[metric.name] = result
   # except Exception as e:
       # print(f"Error while evaluating metric {metric.name}: {e}")

return results

Run the evaluation

evaluation_report = evaluate(ragas_data, metrics=metrics, embeddings=ragas_embeddings,llm=custom_llm) # Pass your custom LLM here

Print the evaluation results

print("RAGAS Evaluation Report:") print(evaluation_report)

the error: Exception has occurred: TypeError Can't instantiate abstract class CustomRagasLLM without an implementation for abstract methods 'agenerate_text', 'generate_text' File "C:\Users\Amin\OneDrive - unige.it\Documenti\projectss\ragas-prototype\src\rag_eva.py", line 158, in custom_llm = CustomRagasLLM(api_key=None) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: Can't instantiate abstract class CustomRagasLLM without an implementation for abstract methods 'agenerate_text', 'generate_text'

dosubot[bot] commented 1 week ago

The error you're encountering indicates that the CustomRagasLLM class is missing implementations for the abstract methods generate_text and agenerate_text from the BaseRagasLLM class. However, your code does include these methods, so the issue might be related to the method signatures not matching the expected ones in the BaseRagasLLM class.

Here's what you need to do:

  1. Ensure Method Signatures Match: The method signatures in your subclass must match those defined in the BaseRagasLLM class. Here are the correct signatures based on the BaseRagasLLM class:

    def generate_text(
       self,
       prompt: PromptValue,
       n: int = 1,
       temperature: float = 1e-8,
       stop: t.Optional[t.List[str]] = None,
       callbacks: Callbacks = None,
    ) -> LLMResult:
       # Your implementation here
    
    async def agenerate_text(
       self,
       prompt: PromptValue,
       n: int = 1,
       temperature: t.Optional[float] = None,
       stop: t.Optional[t.List[str]] = None,
       callbacks: Callbacks = None,
    ) -> LLMResult:
       # Your implementation here
  2. Adjust Your Implementations: Update your generate_text and agenerate_text methods to match these signatures. You will need to adjust the logic inside these methods to handle the additional parameters and return an LLMResult object.

Here's an example of how you might adjust your generate_text method:

def generate_text(
    self,
    prompt: PromptValue,
    n: int = 1,
    temperature: float = 1e-8,
    stop: t.Optional[t.List[str]] = None,
    callbacks: Callbacks = None,
) -> LLMResult:
    # Example implementation
    text = self._call(prompt)
    return LLMResult(generations=[Generation(text=text)])

And for agenerate_text:

async def agenerate_text(
    self,
    prompt: PromptValue,
    n: int = 1,
    temperature: t.Optional[float] = None,
    stop: t.Optional[t.List[str]] = None,
    callbacks: Callbacks = None,
) -> LLMResult:
    # Example implementation
    text = self._call(prompt)
    return LLMResult(generations=[Generation(text=text)])

Make sure to replace PromptValue, LLMResult, and Generation with the appropriate types or classes used in your project. This should resolve the error and allow you to instantiate your CustomRagasLLM class.

To continue talking to Dosu, mention @dosu.

amin-kh96 commented 1 week ago

I tell you one thing, I do this to bypass the requirements of the LLM because already have them. the embeddings of context, query, and the answer, creating this function is only for bypassing the requirements of the API key and the openAI API KEY. is your answer still correct for this situation? @dosu.

amin-kh96 commented 1 week ago

@dosu.

amin-kh96 commented 1 week ago

I tell you one thing, I do this to bypass the requirements of the LLM because already have them. the embeddings of context, query, and the answer, creating this function is only for bypassing the requirements of the API key and the openAI API KEY. is your answer still correct for this situation?

amin-kh96 commented 1 week ago

@dosu. I tell you one thing, I do this to bypass the requirements of the LLM because already have them. the embeddings of context, query, and the answer, creating this function is only for bypassing the requirements of the API key and the openAI API KEY. is your answer still correct for this situation?

dosubot[bot] commented 1 week ago

Yes, the solution I provided is still applicable to your situation. Even if you are bypassing the API key requirements, the CustomRagasLLM class must still implement the abstract methods generate_text and agenerate_text with the correct method signatures as defined in the BaseRagasLLM class. This is necessary to satisfy the requirements of the abstract base class and avoid the instantiation error.

By implementing these methods with the correct signatures, you ensure that your subclass conforms to the expected interface, regardless of whether you are using an API key or not. You can then customize the logic within these methods to suit your needs, such as using precomputed embeddings or other local resources instead of making API calls.

To continue talking to Dosu, mention @dosu.

amin-kh96 commented 1 week ago

@dosu. what do you mean by correct signatures? also, I don't understand this part: Make sure to replace PromptValue, LLMResult, and Generation with the appropriate types or classes used in your project. This should resolve the error and allow you to instantiate your CustomRagasLLM class.

dosubot[bot] commented 1 week ago

By "correct signatures," I mean that the method definitions in your subclass should exactly match the method definitions in the BaseRagasLLM class. This includes the method name, parameters, and return type. Here are the correct method signatures for generate_text and agenerate_text in the BaseRagasLLM class:

def generate_text(
    self,
    prompt: PromptValue,
    n: int = 1,
    temperature: float = 1e-8,
    stop: t.Optional[t.List[str]] = None,
    callbacks: t.List = [],
) -> LLMResult
async def agenerate_text(
    self,
    prompt: PromptValue,
    n: int = 1,
    temperature: float = 1e-8,
    stop: t.Optional[t.List[str]] = None,
    callbacks: t.List = [],
) -> LLMResult

Regarding the part about replacing PromptValue, LLMResult, and Generation with the appropriate types or classes:

In your implementation, ensure that you are using the correct types or classes that are defined in your project or the libraries you are using. This will help you avoid type-related errors and ensure that your subclass is correctly implementing the abstract methods.

To continue talking to Dosu, mention @dosu.

amin-kh96 commented 1 week ago

@dosu. I do not know why I create a subclass i face the error of Can't instantiate abstract class CustomRagasLLM without an implementation for abstract method 'agenerate_text'. but I used it in my function. Can you explain to me why I face instantiate errors?

dosubot[bot] commented 1 week ago

The error occurs because your subclass CustomRagasLLM must implement the abstract method agenerate_text from the BaseRagasLLM class. Even if you've defined a method with the same name, its signature must exactly match the one in the base class. Here's the correct signature you need to use:

async def agenerate_text(
    self,
    prompt: PromptValue,
    n: int = 1,
    temperature: float = 1e-8,
    stop: t.Optional[t.List[str]] = None,
    callbacks: t.List = [],
) -> LLMResult:
    # Your implementation here

Ensure that your method matches this signature, including all parameters and return type. This will resolve the instantiation error [1].

To continue talking to Dosu, mention @dosu.

amin-kh96 commented 1 week ago

@dosu. My Python skills are not at an expert level. I think I need to use 'await' in the async function you gave me above. can you explain it to me, please?

dosubot[bot] commented 1 week ago

To use await in an async function, you need to call asynchronous functions with it. This allows the function to pause and wait for the result of the asynchronous operation without blocking the entire program. Here's how you can use await in the agenerate_text method:

async def agenerate_text(
    self,
    prompt: PromptValue,
    n: int = 1,
    temperature: float = 1e-8,
    stop: t.Optional[t.List[str]] = None,
    callbacks: t.List = [],
) -> LLMResult:
    # Example of using await with an asynchronous call
    result = await some_async_function(prompt)
    return LLMResult(generations=[Generation(text=result)])

In this example, some_async_function is a placeholder for any asynchronous function you might be calling. Using await ensures that the function waits for some_async_function to complete before proceeding. This is essential for handling asynchronous operations properly in Python [1][2].

To continue talking to Dosu, mention @dosu.

amin-kh96 commented 1 week ago

@dosu. can you give me these functions an easier way I mean without using the async function, my python skill is not so good at this moment. these are the codes: def generate_text( self, prompt: PromptValue, n: int = 1, temperature: float = 1e-8, stop: t.Optional[t.List[str]] = None, callbacks: t.List = [], ) -> LLMResult async def agenerate_text( self, prompt: PromptValue, n: int = 1, temperature: float = 1e-8, stop: t.Optional[t.List[str]] = None, callbacks: t.List = [], ) -> LLMResult

dosubot[bot] commented 1 week ago

To implement the generate_text function without using async, you can keep it synchronous. Here's how you can do it:

def generate_text(
    self,
    prompt: PromptValue,
    n: int = 1,
    temperature: float = 1e-8,
    stop: t.Optional[t.List[str]] = None,
    callbacks: t.List = [],
) -> LLMResult:
    # Example implementation
    text = self._call(prompt)
    return LLMResult(generations=[Generation(text=text)])

For the agenerate_text function, if you want to avoid async, you can simply call the synchronous generate_text method:

def agenerate_text(
    self,
    prompt: PromptValue,
    n: int = 1,
    temperature: float = 1e-8,
    stop: t.Optional[t.List[str]] = None,
    callbacks: t.List = [],
) -> LLMResult:
    # Call the synchronous method
    return self.generate_text(prompt, n, temperature, stop, callbacks)

This way, you avoid using async while still implementing the required methods [1].

To continue talking to Dosu, mention @dosu.

amin-kh96 commented 1 week ago

@dosu. please check my function if there is any error or mistake help me to solve it. my code:

Define the custom LLM class

class CustomRagasLLM(BaseRagasLLM): def init(self, api_key: str = None): """ Initialize the custom LLM, optionally using an API key if necessary. """ self.api_key = api_key

def _call(self, prompt: str) -> str:
    """
    Process the prompt and return a result. This can be customized to
    use a local model or perform any required logic.
    """
    if not self.api_key:
        return f"Processed: {prompt} (without API key)"
    else:
        # Handle LLM response if using an API
        return f"Processed: {prompt} (with API key: {self.api_key})"

def generate_text(
self,
prompt: PromptValue,
n: int = 1,
temperature: float = 1e-8,
stop: t.Optional[t.List[str]] = None,
callbacks: t.List = [],

) -> LLMResult: text = self._call(prompt) return LLMResult(generations=[Generation(text=text)])

def agenerate_text(
self,
prompt: PromptValue,
n: int = 1,
temperature: float = 1e-8,
stop: t.Optional[t.List[str]] = None,
callbacks: t.List = [],

) -> LLMResult:

Call the synchronous method

 return self.generate_text(prompt, n, temperature, stop, callbacks)

custom_llm =CustomRagasLLM(api_key=None)

dosubot[bot] commented 1 week ago

Your CustomRagasLLM class implementation looks mostly correct, but there are a few adjustments needed:

  1. Async Method: The agenerate_text method should be defined as async if you want to keep it asynchronous, even if it calls a synchronous method. If you prefer to avoid async, you can keep it as is, but it should not be labeled as an async method.

  2. Return Types: Ensure that LLMResult and Generation are correctly imported and used. They should match the expected types in your project.

Here's a revised version of your class:

from ragas.llms import BaseRagasLLM
from ragas.llms.prompt import PromptValue
from langchain_core.outputs import LLMResult, Generation
import typing as t

class CustomRagasLLM(BaseRagasLLM):
    def __init__(self, api_key: str = None):
        self.api_key = api_key

    def _call(self, prompt: str) -> str:
        if not self.api_key:
            return f"Processed: {prompt} (without API key)"
        else:
            return f"Processed: {prompt} (with API key: {self.api_key})"

    def generate_text(
        self,
        prompt: PromptValue,
        n: int = 1,
        temperature: float = 1e-8,
        stop: t.Optional[t.List[str]] = None,
        callbacks: t.List = [],
    ) -> LLMResult:
        text = self._call(prompt)
        return LLMResult(generations=[Generation(text=text)])

    def agenerate_text(
        self,
        prompt: PromptValue,
        n: int = 1,
        temperature: float = 1e-8,
        stop: t.Optional[t.List[str]] = None,
        callbacks: t.List = [],
    ) -> LLMResult:
        return self.generate_text(prompt, n, temperature, stop, callbacks)

custom_llm = CustomRagasLLM(api_key=None)

Make sure you have the correct imports for PromptValue, LLMResult, and Generation [1].

To continue talking to Dosu, mention @dosu.

amin-kh96 commented 1 week ago

@dosu. my dataset is in this format: [{'question': [...], 'answer': [...], 'contexts': [...]}, {'question': [...], 'answer':[...],contexts': [...]},...] the value of each key is a list of float numbers (computed embeddings). I already created subclasses of baseragasembeddings and baseragasLLM, but I get this error now:

xception has occurred: AttributeError 'list' object has no attribute 'get_sample_type' File "C:\Users\Amin\OneDrive - unige.it\Documenti\projectss\ragas-prototype\src\rag_eva.py", line 215, in evaluation_report = evaluate(ragas_data, metrics=metrics, embeddings=ragas_embeddings,llm=custom_llm) # Pass your custom LLM here ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'list' object has no attribute 'get_sample_type'

dosubot[bot] commented 1 week ago

The error you're encountering is because the evaluate function expects the dataset to be in a specific format, but it seems like you're passing a list of dictionaries instead of the required format.

The evaluate function in the Ragas library expects the dataset to be a dictionary with the following structure:

dataset: dict[question: list[str], ground_truth: list[list[str]]]

Here's how you can adjust your dataset to match the expected format:

  1. Convert Your Dataset: Transform your list of dictionaries into the required dictionary format. You need to extract the question and ground_truth (which might be your contexts or answer) from your data.

  2. Example Conversion:

ragas_data = {
    "question": [entry["question"] for entry in your_data],
    "ground_truth": [entry["contexts"] for entry in your_data]  # or "answer" if that's your ground truth
}
  1. Pass the Correct Format: Use this transformed ragas_data when calling the evaluate function.

This should resolve the AttributeError by ensuring the dataset is in the correct format for the evaluate function [1][2].

To continue talking to Dosu, mention @dosu.

amin-kh96 commented 1 week ago

@dosu. but my dataset is in this format I give you the code and take a look,

Assuming data_set is a list of dictionaries

ragas_data = [ { "question": entry["text_vector_1"], # Assuming this is a list of strings "answer": entry["text_vector_2"], # Assuming this is a list of strings "contexts": entry["chunks"] # Assuming this is a list of lists of strings } for entry in data_set ]

Create the required structure for Dataset

formatted_data = { "question": [entry["question"] for entry in ragas_data], "contexts": [entry["contexts"] for entry in ragas_data], "answer": [entry["answer"] for entry in ragas_data] }

Now convert this dictionary into a Hugging Face Dataset

ragas_dataset = Dataset.from_dict(formatted_data)

dosubot[bot] commented 1 week ago

To resolve the 'list' object has no attribute 'get_sample_type' error, you need to ensure that your dataset is an instance of EvaluationDataset, not just a list or a Hugging Face Dataset. Here's how you can convert your data into an EvaluationDataset:

  1. Use EvaluationDataset.from_list: Convert your list of dictionaries into an EvaluationDataset.

Here's how you can do it:

from your_module import EvaluationDataset  # Replace with the actual module name

# Assuming data_set is your list of dictionaries
ragas_data = [
    {
        "question": entry["text_vector_1"],
        "answer": entry["text_vector_2"],
        "contexts": entry["chunks"]
    }
    for entry in data_set
]

# Create an EvaluationDataset instance
evaluation_dataset = EvaluationDataset.from_list(ragas_data)

# Now you can use this dataset with the evaluate function
evaluation_report = evaluate(evaluation_dataset, metrics=metrics, embeddings=ragas_embeddings, llm=custom_llm)

Make sure to replace your_module with the actual module where EvaluationDataset is defined. This should resolve the error by ensuring the dataset is in the correct format for the evaluate function [1].

To continue talking to Dosu, mention @dosu.