explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀
https://docs.ragas.io
Apache License 2.0
7.32k stars 746 forks source link

evaluate function #1421

Open amin-kh96 opened 1 month ago

amin-kh96 commented 1 month ago

I create a subclass of baseragassembeddings. because I already have all the embeddings for context, query, and question. I did this to not use the openai API key. because it is costly and also I want to use other models like mistral or etc. The model that I used to create the embeddings is 'text-embedding-ada-002'. but the problem that I have been dealing with this error: Exception has occurred: OpenAIError The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable File "C:\Users\Amin\OneDrive - unige.it\Desktop\tirocini\code\version01rageva.py", line 165, in evaluation_report = evaluate(ragas_data, metrics=metrics, embeddings=custom_embeddings) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ openai.OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable.

as far as I remember I knew that I do not need for an API key. so please help me fix this error. I also put my code here for taking a look. note that I am using a model from Hugginface, and the model name is : model_name = 'distilbert-base-uncased'

import json from transformers import AutoModel, AutoTokenizer import torch import numpy as np from datasets import Dataset from ragas.embeddings import BaseRagasEmbeddings from ragas.metrics import context_utilization,ContextUtilization from ragas import evaluate

Load the ground truth data

file_path = 'assets\GT.json' with open(file_path) as f: ground_truth_data = json.load(f)

Load the question and the answer and the chunks

file_path = 'assets\user_llm_interaction_embeddings_c1521dd5_b819_4241_b3a4_3e5c1388037c.json' with open(file_path) as f: llm = json.load(f)

Initialize an empty list to hold the new dataset

data_set = []

Iterate through the list and combine every two dictionaries

for i in range(0, len(llm), 2): combined_dict = { "text_vector_1": llm[i].get("text_vector", []), "text_vector_2": llm[i + 1].get("text_vector", []), 'chunks': llm[i + 1].get('chunks', []) } data_set.append(combined_dict)

def map_chunks(data_set, ground_truth_data): for item in data_set: # Iterate over each dictionary in data_set c = [] # Reset c for each item for chunk_id in item['chunks']: # Loop through 'chunks' in the current dictionary for element in ground_truth_data: # Loop through ground_truth_data if element['id'] == chunk_id: # Match chunk_id with element's id c.append(element['text_vector']) # Append the matching text_vector to c item['chunks'] = c # Replace the original 'chunks' (ids) with the mapped text_vector values

return data_set  # Return the updated data_set

data_set = map_chunks(data_set, ground_truth_data)

Assuming data_set is a list of dictionaries

ragas_data = [ { "question": entry["text_vector_1"], # Assuming this is a list of strings "answer": entry["text_vector_2"], # Assuming this is a list of strings "contexts": entry["chunks"] # Assuming this is a list of lists of strings } for entry in data_set ]

Create the required structure for Dataset

formatted_data = { "question": [entry["question"] for entry in ragas_data], "contexts": [entry["contexts"] for entry in ragas_data], "answer": [entry["answer"] for entry in ragas_data] }

model_name = 'distilbert-base-uncased'

class CustomHuggingFaceRagasEmbeddings(BaseRagasEmbeddings): def init(self, model_name: str, custom_embeddings: list = None): """ Initialize the Custom Hugging Face Ragas Embeddings with the specified model and custom embeddings.

    Parameters:
        model_name (str): The name of the Hugging Face model to use (e.g., 'distilbert-base-uncased').
        custom_embeddings (list): A list of pre-computed custom embeddings (optional).
    """
    model_name = 'distilbert-base-uncased'
    self.tokenizer = AutoTokenizer.from_pretrained(model_name)
    self.model = AutoModel.from_pretrained(model_name)
    self.custom_embeddings = custom_embeddings  # Store the custom embeddings

def embed_documents(self, texts: list) -> np.ndarray:
    """
    Generate embeddings for a list of documents.

    Parameters:
        texts (list): A list of documents to embed.

    Returns:
        np.ndarray: An array of embeddings for the documents.
    """
    if self.custom_embeddings is not None:
        # If custom embeddings are provided, return those instead
        return np.array(self.custom_embeddings)

    # Generate new embeddings using the model if no custom embeddings are available
    inputs = self.tokenizer(texts, return_tensors='pt', padding=True, truncation=True)

    with torch.no_grad():
        outputs = self.model(**inputs)

    # Use the pooled output or the CLS token as the embedding
    embeddings = outputs.last_hidden_state[:, 0, :]  # CLS token for sentence embedding
    return embeddings.numpy()  # Convert to NumPy array

def embed_query(self, query: str) -> np.ndarray:
    """
    Generate an embedding for a single query.

    Parameters:
        query (str): The query to embed.

    Returns:
        np.ndarray: The embedding for the query.
    """
    # If custom embeddings are provided, generate embedding based on those
    if self.custom_embeddings is not None:
        # You might want to handle how to relate the query to your custom embeddings
        raise NotImplementedError("Custom query embeddings are not supported with provided custom embeddings.")

    # Generate a new embedding using the model
    inputs = self.tokenizer(query, return_tensors='pt', padding=True, truncation=True)

    with torch.no_grad():
        outputs = self.model(**inputs)

    # Use the pooled output or the CLS token as the embedding
    embedding = outputs.last_hidden_state[:, 0, :]  # CLS token for single query embedding
    return embedding.numpy()  # Convert to NumPy array

Initialize the custom embeddings class

custom_embeddings = CustomHuggingFaceRagasEmbeddings(ragas_data)

ragas_embeddings = CustomHuggingFaceRagasEmbeddings(model_name=model_name, custom_embeddings=custom_embeddings)

Define the evaluation metrics

metrics = [context_utilization]

lets define a custom function of evaluate function

def custom_evaluate(ragas_data, metrics, embeddings: BaseRagasEmbeddings):

""" Custom evaluation function that avoids using OpenAI API.

Parameters:
    dataset: The dataset to evaluate.
    metrics: A list of metrics to evaluate.
    embeddings: A custom embedding model (subclass of BaseRagasEmbeddings).

Returns:
    A dictionary of evaluation results.
"""

results = {}

# Iterate over the metrics and evaluate
#for metric in metrics:
    #try:
        # Make sure the metric is compatible with the custom embeddings
       # if isinstance(metric, ContextUtilization):  # Check for specific metric type
            #result = evaluate(ragas_data, metric, embeddings=embeddings)
    #        results[metric.name] = result
   # except Exception as e:
       # print(f"Error while evaluating metric {metric.name}: {e}")

return results

Run the evaluation

evaluation_report = evaluate(ragas_data, metrics=metrics, embeddings=custom_embeddings)

Print the evaluation results

print("RAGAS Evaluation Report:") print(evaluation_report)

dosubot[bot] commented 1 month ago

To resolve the OpenAIError related to the API key when using a custom subclass of BaseRagasEmbeddings with the 'distilbert-base-uncased' model from Hugging Face, you should ensure that your custom embeddings class does not inadvertently require an OpenAI API key. It seems like you are already using a Hugging Face model, which should not require an OpenAI API key. However, the error suggests that somewhere in your code, there might be a call or configuration that still expects an OpenAI API key.

Here are a few steps to troubleshoot and resolve the issue:

  1. Check the evaluate Function: Ensure that the evaluate function you are using does not internally require an OpenAI API key. If it does, you might need to modify it or use a different evaluation approach that does not depend on OpenAI services.

  2. Custom Embeddings Initialization: When you initialize your CustomHuggingFaceRagasEmbeddings, make sure you are not passing any parameters or configurations that might trigger the use of OpenAI services. Your current implementation seems correct in this regard, as it uses Hugging Face's AutoModel and AutoTokenizer.

  3. Remove OpenAI Dependencies: Double-check your imports and any other parts of your code to ensure there are no lingering dependencies or calls to OpenAI's API. This includes checking any utility functions or libraries that might be used within the evaluate function.

  4. Use HuggingfaceEmbeddings Directly: If the issue persists, consider using the HuggingfaceEmbeddings class provided by the Ragas framework, which is specifically designed to work with Hugging Face models without requiring an OpenAI API key. Here's a quick example of how you can use it:

    from ragas.embeddings.base import HuggingfaceEmbeddings
    
    # Create an instance of HuggingfaceEmbeddings with the desired model
    embeddings = HuggingfaceEmbeddings(model_name='distilbert-base-uncased')
    
    # Use the embed_query or embed_documents methods to get embeddings
    text_embedding = embeddings.embed_query("Your text here")
    document_embeddings = embeddings.embed_documents(["Document 1", "Document 2"])

By following these steps, you should be able to avoid the need for an OpenAI API key and resolve the error [1].

To continue talking to Dosu, mention @dosu.

amin-kh96 commented 1 month ago

I checked the evaluate function and deep inside there is a function called 'embedding_factory'. by default it use openai I put the whole function here.

def embedding_factory( model: str = "text-embedding-ada-002", run_config: t.Optional[RunConfig] = None ) -> BaseRagasEmbeddings: openai_embeddings = OpenAIEmbeddings(model=model) if run_config is not None: openai_embeddings.request_timeout = run_config.timeout else: run_config = RunConfig() return LangchainEmbeddingsWrapper(openai_embeddings, run_config=run_config)

is it possible to modify the embedding_factory default openai embedding while it is used in the 'evaluate' function?
dosubot[bot] commented 1 month ago

Yes, you can modify the behavior of the evaluate function to use your custom embeddings instead of the default OpenAI embeddings provided by the embedding_factory. You can do this by passing your custom embeddings directly to the evaluate function. This will bypass the use of the embedding_factory function, as the evaluate function checks if embeddings are provided and uses them instead of creating new ones.

Here's how you can do it:

  1. Ensure your custom embeddings class is correctly implemented and instantiated.
  2. Pass the instance of your custom embeddings to the evaluate function using the embeddings parameter.

For example:

# Assuming `custom_embeddings` is an instance of your CustomHuggingFaceRagasEmbeddings
evaluation_report = evaluate(ragas_data, metrics=metrics, embeddings=custom_embeddings)

By providing the embeddings parameter, the evaluate function will use your custom embeddings and will not call the embedding_factory function, thus avoiding the default OpenAI embeddings [1].

To continue talking to Dosu, mention @dosu.

amin-kh96 commented 1 month ago

I want to substitute the output of a function called embedded factory with my computed embeddings. because I already have the computed embeddings both for the answer and the question and also the context in a Ragas data frame. I have every number (float) needed for the computation. but ragas evaluate function does the computation again.