explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
6.58k stars 646 forks source link

Ragas evaluation #1297

Open Senthselvi opened 5 days ago

Senthselvi commented 5 days ago

Faithfulness is not callable

dosubot[bot] commented 5 days ago

To resolve the issue where "Faithfulness is not callable" in the Ragas evaluation, you need to ensure that you are instantiating the Faithfulness class correctly. The error suggests that you might be trying to call an instance of Faithfulness as if it were a function.

Here is the correct way to instantiate and use the Faithfulness class:

faithfulness = Faithfulness()

Make sure you are not trying to call faithfulness as a function. For example, if you have something like this:

score = faithfulness()

You should change it to:

# Assuming you have a sample and callbacks defined
score = await faithfulness._single_turn_ascore(sample, callbacks)

Ensure that you are using the methods provided by the Faithfulness class correctly [1].

To continue talking to Dosu, mention @dosu.

Senthselvi commented 5 days ago

Ok got it

Senthselvi commented 4 days ago

why is Empty DataFrame is returned? llm, embed_model = init_llm_ollama(base_url, bearer_token) def evaluate( dataset: List[LLMTestCase], metrics: List[Callable], llm: llm, embeddings: embed_model, raise_exceptions: bool = True, ) -> pd.DataFrame: results = []

# Convert results to DataFrame
results_df = pd.DataFrame(results)
return results_df

Sample data

data = { 'question': ['When was the first super bowl?', 'Who won the most super bowls?'], 'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'], 'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']], 'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times'] } data_samples = Dataset.from_dict(data)

Example usage

test_cases = [ LLMTestCase( question=question, answer=answer, contexts=context, ground_truth=ground_truth ) for question, answer, ground_truth, context in zip( data_samples['question'], data_samples['answer'], data_samples['ground_truth'], data_samples['contexts'] ) ]

List of metrics functions

metrics = [ answer_relevancy,

context_precision,

# faithfulness,
# context_recall

]

Evaluate

async def main(): results_df = evaluate( dataset=test_cases, metrics=[Faithfulness], llm=llm, embeddings=embed_model, raise_exceptions=False )

print(results_df)
results_df.to_csv(r'C:\TestingBot\Scripts\V1\myfile.csv', sep=',')

Run the async main function

asyncio.run(main())

dosubot[bot] commented 4 days ago

An Empty DataFrame is returned when evaluating the dataset using the evaluate function because the dataset specified by dataset_name does not exist in LangSmith. The function raises a ValueError if the dataset is not found, which indicates that the dataset must be present in LangSmith for the evaluation to proceed [1].

Additionally, the evaluate function in the Ragas framework requires the dataset to be in a specific format with columns such as question, contexts, answer, and ground_truth. If the dataset does not conform to this format or if the required columns are missing, the function will raise a ValueError, leading to an empty or non-existent result [2].

Ensure that your dataset is correctly formatted and exists in LangSmith. Also, make sure you have set the necessary environment variables, such as OPENAI_API_KEY, if you are using metrics that require access to external APIs like OpenAI.

To continue talking to Dosu, mention @dosu.

Senthselvi commented 3 days ago

Do ragas support llama3.1:8b model, getting NAN output

dosubot[bot] commented 3 days ago

Common reasons for getting NaN outputs in the Ragas evaluation framework include:

  1. Exceptions during metric evaluation: If any of the metrics fail during evaluation and raise_exceptions is set to False, the evaluation will return np.nan for the row that failed.
  2. Validation failures: If the required columns for the metrics are missing or if the columns are of the wrong format, it can lead to NaN outputs.
  3. Unsupported sample type: If the sample type is unsupported, it can cause issues in the evaluation process, potentially leading to NaN outputs.
  4. Initialization issues: If the language model (LLM) or embeddings are not properly initialized, it can result in NaN outputs for metrics that depend on them [1].

Regarding support for the llama3.1:8b model, I couldn't find specific information in the repository. However, you can refer to the example code for integrating LlamaIndex with Ragas, which might help you set up and evaluate your model correctly [2].

To continue talking to Dosu, mention @dosu.

Senthselvi commented 3 days ago

Python version - 3.11 Ragas- 0.1.16

Senthselvi commented 3 days ago

share me the correct evaluation metrics

dosubot[bot] commented 3 days ago

The correct evaluation metrics to use with Ragas version 0.1.16 in Python 3.11 are:

  1. context_precision
  2. context_recall
  3. faithfulness
  4. answer_relevancy
  5. context_utilization
  6. context_entity_recall
  7. noise_sensitivity
  8. summarization_score
  9. answer_semantic_similarity
  10. answer_correctness

You can import and use these metrics as follows:

from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    context_utilization,
    context_entity_recall,
    noise_sensitivity,
    summarization_score,
    answer_semantic_similarity,
    answer_correctness,
)

These metrics measure various aspects of your Retrieval Augmented Generation (RAG) systems:

  1. Faithfulness - Measures the factual consistency of the answer to the context based on the question.
  2. Context_precision - Measures how relevant the retrieved context is to the question, conveying the quality of the retrieval pipeline.
  3. Answer_relevancy - Measures how relevant the answer is to the question.
  4. Context_recall - Measures the retriever's ability to retrieve all necessary information required to answer the question.
  5. Context_utilization - Evaluates how effectively the retrieved context is used in generating the answer.
  6. Context_entity_recall - Measures the recall of entities in the retrieved context.
  7. Noise_sensitivity - Assesses the robustness of the system to noisy inputs.
  8. Summarization_score - Evaluates the quality of summaries generated by the system.
  9. Answer_semantic_similarity - Measures the semantic similarity between the generated answer and the expected answer.
  10. Answer_correctness - Evaluates the correctness of the generated answer.

To run the evaluation on your dataset with these metrics, you can use the evaluate function provided by Ragas:

from ragas import evaluate
from datasets import Dataset

# Example dataset
dataset = Dataset.from_dict({
    'question': ["What is the capital of France?"],
    'contexts': [["Paris is the capital of France."]],
    'answer': ["Paris"],
    'ground_truth': [["Paris"]]
})

# Run evaluation
result = evaluate(dataset)
print(result)

This will give you a comprehensive view of your RAG system's performance based on the specified metrics [1][2][3].

To continue talking to Dosu, mention @dosu.

Senthselvi commented 3 days ago

error while evaluating verify that the data type in your code aligns with the expected pyarrow data types.

dosubot[bot] commented 3 days ago

To ensure that your data types align with the expected pyarrow data types in the Ragas evaluation framework, verify that your dataset columns have the following types:

Here's an example of how to prepare your dataset with the correct data types:

from datasets import Dataset

# Example dataset
data = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'contexts': [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
                 ['The Green Bay Packers...Green Bay, Wisconsin.', 'The Packers compete...Football Conference']],
    'ground_truth': [['The first superbowl was held on January 15, 1967'], ['The New England Patriots have won the Super Bowl a record six times']]
}

# Convert to Hugging Face Dataset
dataset = Dataset.from_dict(data)

# Verify the data types
print(dataset)

# Example usage with Ragas evaluation
from ragas import evaluate
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    context_utilization,
    context_entity_recall,
    noise_sensitivity,
    summarization_score,
    answer_semantic_similarity,
    answer_correctness,
)

# List of metrics
metrics = [
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    context_utilization,
    context_entity_recall,
    noise_sensitivity,
    summarization_score,
    answer_semantic_similarity,
    answer_correctness,
]

# Run evaluation
result = evaluate(dataset, metrics=metrics)
print(result)

Ensure that your dataset conforms to these data types to avoid errors during evaluation [1][2][3].

To continue talking to Dosu, mention @dosu.

Senthselvi commented 3 days ago

'dict' object has no attribute 'rename_columns'

dosubot[bot] commented 3 days ago

You are getting the error "'dict' object has no attribute 'rename_columns'" because the rename_columns method is being called on a dictionary object instead of a Dataset object. The rename_columns method is a method of the Dataset class from the datasets library, not a dictionary.

In the provided code, the rename_columns method is used in the convert_v1_to_v2_dataset and convert_v2_to_v1_dataset functions, which expect a Dataset object as their input. If you pass a dictionary to these functions, you will encounter this error.

Here is the relevant code snippet:

def convert_v1_to_v2_dataset(dataset: Dataset) -> Dataset:
    columns_map = {v: k for k, v in REQUIRED_COLS_v1.items() if v in dataset.features}
    return dataset.rename_columns(columns_map)

def convert_v2_to_v1_dataset(dataset: Dataset) -> Dataset:
    columns_map = {k: v for k, v in REQUIRED_COLS_v1.items() if k in dataset.features}
    return dataset.rename_columns(columns_map)

Additionally, you can refer to the remap_column_names function from the ragas library, which also demonstrates how to remap column names in a Dataset object:

def remap_column_names(dataset: Dataset, column_map: dict[str, str]) -> Dataset:
    """
    Remap the column names in case dataset uses different column names
    """
    inverse_column_map = {v: k for k, v in column_map.items()}
    return dataset.rename_columns(inverse_column_map)

Make sure that you are passing a Dataset object to these functions, not a dictionary [1][2].

To continue talking to Dosu, mention @dosu.

Senthselvi commented 3 days ago

AttributeError: 'LLMTestCase' object has no attribute 'items' data = LLMTestCase( question=query, answer=actual_output, ground_truth="Vector databases store and provide access to unstructured data", contexts=retrieval_context, )

dosubot[bot] commented 3 days ago

The AttributeError: 'LLMTestCase' object has no attribute 'items' suggests that somewhere in your code, an LLMTestCase object is being treated as a dictionary. This typically happens when a method or function expects a dictionary but receives an LLMTestCase object instead.

To resolve this, ensure that you are correctly using the LLMTestCase class and not mistakenly treating it as a dictionary. Here is an example of how to correctly instantiate and use an LLMTestCase object:

# Assuming LLMTestCase is defined correctly
data = LLMTestCase(
    question=query,
    answer=actual_output,
    ground_truth="Vector databases store and provide access to unstructured data",
    contexts=retrieval_context,
)

# Example usage
print(data.question)  # Accessing attributes of LLMTestCase
print(data.answer)
print(data.ground_truth)
print(data.contexts)

If you need to convert an LLMTestCase object to a dictionary for some reason, you can manually create the dictionary:

data_dict = {
    'question': data.question,
    'answer': data.answer,
    'ground_truth': data.ground_truth,
    'contexts': data.contexts,
}

# Now you can use data_dict as a dictionary

Ensure that any function or method expecting a dictionary receives data_dict instead of the LLMTestCase object directly. This should resolve the AttributeError.

To continue talking to Dosu, mention @dosu.

Senthselvi commented 3 days ago

ValueError: Dataset feature "contexts" should be of type Sequence[string], got <class 'datasets.features.features.Sequence'> data = { 'question': [query], # List with a single question 'answer': [actual_output], # List with a single answer 'contexts': [retrieval_context], # List with a single list of contexts (retrieval_context should be a list of strings) 'ground_truth': [ "Vector databases store and provide access to unstructured data, such as text or images, in the form of their vector embeddings. " "Because similar objects are close together in vector space, the similarity of data objects can be calculated based on the distance between the data object’s vector embeddings." ] # List with a single ground truth }

dosubot[bot] commented 3 days ago

To resolve the ValueError: "Dataset feature 'contexts' should be of type Sequence[string], got <class 'datasets.features.features.Sequence>'", you need to ensure that the contexts field is correctly specified as a sequence of strings. Here is how you can adjust your dataset creation to meet the expected format:

  1. Ensure that retrieval_context is a list of strings.
  2. Explicitly define the feature types when creating the dataset.

Here is the corrected code snippet:

from datasets import Dataset, Features, Sequence, Value

# Ensure retrieval_context is a list of strings
retrieval_context = [
    "Context 1",
    "Context 2",
    # Add more contexts as needed
]

# Define your data
data = {
    'question': [query],  # List with a single question
    'answer': [actual_output],  # List with a single answer
    'contexts': [retrieval_context],  # List with a single list of contexts
    'ground_truth': [
        "Vector databases store and provide access to unstructured data, such as text or images, in the form of their vector embeddings. "
        "Because similar objects are close together in vector space, the similarity of data objects can be calculated based on the distance between the data object’s vector embeddings."
    ]  # List with a single ground truth
}

# Define the features explicitly
features = Features({
    'question': Sequence(Value('string')),
    'answer': Sequence(Value('string')),
    'contexts': Sequence(Sequence(Value('string'))),
    'ground_truth': Sequence(Value('string'))
})

# Create the dataset
dataset = Dataset.from_dict(data, features=features)

# Verify the dataset
print(dataset)

This code ensures that the contexts field is recognized as a sequence of strings, which should resolve the ValueError [1][2][3].

To continue talking to Dosu, mention @dosu.

Senthselvi commented 3 days ago

KeyError('summary') Exception raised in Job[2]: TypeError(CustomOllama.agenerate_text() got an unexpected keyword argument 'n')

dosubot[bot] commented 3 days ago

You are encountering a KeyError('summary') and a TypeError stating that CustomOllama.agenerate_text() got an unexpected keyword argument 'n'.

  1. KeyError('summary'):

    • Ensure that the data structure you are accessing contains the 'summary' key. The expected keys are summary and questions [1][2].
  2. TypeError(CustomOllama.agenerate_text() got an unexpected keyword argument 'n'):

    • The CustomOllama class's agenerate_text method does not support the n argument. Update the method signature to include the n argument:
async def agenerate_text(
    self,
    prompt: PromptValue,
    n: int = 1,
    temperature: t.Optional[float] = None,
    stop: t.Optional[t.List[str]] = None,
    callbacks: t.List = [],
) -> LLMResult:
    # Your implementation here

This should resolve the issues you are facing [3][4].

To continue talking to Dosu, mention @dosu.

Senthselvi commented 2 days ago

AttributeError('CustomOllama' object has no attribute 'model_name')

dosubot[bot] commented 2 days ago

The AttributeError: 'CustomOllama' object has no attribute 'model_name' indicates that the CustomOllama class is missing the model_name attribute. To resolve this, you need to ensure that the CustomOllama class includes the model_name attribute.

Here is an example of how you might define the CustomOllama class with the model_name attribute:

class CustomOllama:
    def __init__(self, model_name: str, other_params: dict):
        self.model_name = model_name
        # Initialize other attributes as needed
        for key, value in other_params.items():
            setattr(self, key, value)

    async def agenerate_text(
        self,
        prompt: str,
        n: int = 1,
        temperature: float = 1e-8,
        stop: t.Optional[t.List[str]] = None,
        callbacks: t.List = [],
    ) -> LLMResult:
        # Your implementation here
        pass

    # Add other methods as needed

Ensure that when you instantiate the CustomOllama class, you provide the model_name:

ollama = CustomOllama(model_name="llama3.1:8b", other_params={})

This should resolve the AttributeError by ensuring that the model_name attribute is properly defined and accessible in the CustomOllama class.

To continue talking to Dosu, mention @dosu.