explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
5.63k stars 523 forks source link

[R-240] (docs): Document how to evaluating with a locally hosted LLM to help choose the ones that work best #859

Open Exploding-squid opened 2 months ago

Exploding-squid commented 2 months ago

[X] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug I have a locally hosted LLM which I am intending to use as a judge LLM. When using a simple test dataset, it appears that the LLM scores for Faithfulness are not correctly evaluated.

I am using a self-hosted version of Zephyr-7B downloaded from Hugging Face, in addition to all-MiniLM-V2 as my Sentence Transformers model

Ragas version: 1.16 Python version: 3.8

Code to Reproduce

from langchain_core.embeddings import Embeddings

from ragas.llms.base import LangchainLLMWrapper

from ragas.embeddings.base import LangchainEmbeddingsWrapper

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

from ragas.llms import BaseRagasLLM

from langchain_community.llms import HuggingFacePipeline

from langchain_community.embeddings import HuggingFaceEmbeddings

from datasets import load_dataset, Dataset

from transformers import GenerationConfig

 

pipe = pipeline(task="text-generation", model="path/to/LLM",

                     torch_dtype=torch.bfloat16, device_map="auto", max_new_tokens=2000)

hf = HuggingFacePipeline(pipeline=pipe)

llm = LangchainLLMWrapper(hf)

model_name = "/path/to/sentencetransformers/model"

model_kwargs = {'device': 'cpu'}

encode_kwargs = {'normalize_embeddings': False}

hf_e = HuggingFaceEmbeddings(

    model_name=model_name,

    model_kwargs=model_kwargs,

    encode_kwargs=encode_kwargs

)

embedder = LangchainEmbeddingsWrapper(hf_e)

print("llm loaded")

from ragas import evaluate

 

test_data = {

    'question': ['What is the capital of France?', 'Who wrote "Romeo and Juliet"?'],

    'contexts': [['Bananas are an excellent source of potassium', 'France is known for its cuisine.', 'Data is more valulable than oil'],

                 ['William Shakespeare wrote "Romeo and Juliet".', 'The play is a tragedy.']],

    'answer': ['Paris', 'William Shakespeare'],

    'ground_truth': ['Paris', 'William Shakespeare']

}

 

dataset_eval=Dataset.from_dict(test_data)

 

from ragas.metrics import (

    answer_relevancy,

    answer_similarity,

    answer_correctness,

    faithfulness,

    context_recall,

    context_precision,

    context_relevancy,

)

start_time = time.time()

 

results = evaluate(dataset_eval,

                    metrics=[answer_similarity,

                             answer_relevancy,

                             answer_correctness,

                    faithfulness,

                    context_recall,

                    context_precision,

                    context_relevancy,

                    ],

                   llm=llm, embeddings=embedder)

 

print(results)

 

finish_time = time.time()

 

print(f"time elapsed (minutes) = {(finish_time-start_time)/60}")

Error trace

Loading checkpoint shards: 100%|██████████| 8/8 [00:59<00:00, 7.40s/it] llm loaded Evaluating: 0%| | 0/14 [00:00<?, ?it/s]/home/venv/lib64/python3.8/site-packages/transformers/generation/utils.py:1421: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration ) warnings.warn( /home/venv/lib64/python3.8/site-packages/transformers/pipelines/base.py:1101: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset warnings.warn( Evaluating: 7%|▋ | 1/14 [00:00<00:10, 1.24it/s]/home/venv/lib64/python3.8/site-packages/transformers/pipelines/base.py:1101: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset warnings.warn( Evaluating: 21%|██▏ | 3/14 [01:57<07:56, 43.34s/it]/home/venv/lib64/python3.8/site-packages/transformers/pipelines/base.py:1101: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset warnings.warn( /home/venv/lib64/python3.8/site-packages/transformers/pipelines/base.py:1101: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset warnings.warn( Evaluating: 29%|██▊ | 4/14 [20:32<1:09:01, 414.16s/it]/home/venv/lib64/python3.8/site-packages/transformers/pipelines/base.py:1101: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset warnings.warn( Evaluating: 36%|███▌ | 5/14 [20:33<41:36, 277.40s/it] /home/venv/lib64/python3.8/site-packages/transformers/pipelines/base.py:1101: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset warnings.warn( Evaluating: 43%|████▎ | 6/14 [20:34<25:10, 188.85s/it]/home/venv/lib64/python3.8/site-packages/transformers/pipelines/base.py:1101: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset warnings.warn( Evaluating: 50%|█████ | 7/14 [20:34<15:06, 129.52s/it]/home/venv/lib64/python3.8/site-packages/transformers/pipelines/base.py:1101: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset warnings.warn( /home/venv/lib64/python3.8/site-packages/transformers/pipelines/base.py:1101: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset warnings.warn( Failed to parse output. Returning None. Evaluating: 57%|█████▋ | 8/14 [23:57<15:14, 152.39s/it]Failed to parse output. Returning None. /home/venv/lib64/python3.8/site-packages/transformers/pipelines/base.py:1101: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset warnings.warn( Evaluating: 64%|██████▍ | 9/14 [33:41<23:43, 284.74s/it]Failed to parse output. Returning None. Evaluating: 71%|███████▏ | 10/14 [33:50<13:22, 200.63s/it]Failed to parse output. Returning None. Evaluating: 79%|███████▊ | 11/14 [33:51<07:00, 140.07s/it]Failed to parse output. Returning None. /home/venv/lib64/python3.8/site-packages/transformers/pipelines/base.py:1101: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset warnings.warn( Failed to parse output. Returning None. Evaluating: 100%|██████████| 14/14 [38:50<00:00, 166.48s/it]

/home/venv/lib64/python3.8/site-packages/ragas/evaluation.py:276: RuntimeWarning: Mean of empty slice value = np.nanmean(self.scores[cn]) {'answer_similarity': 1.0000, 'answer_relevancy': 0.5079, 'answer_correctness': 0.6500, 'faithfulness': nan, 'context_recall': 0.0000, 'context_precision': 0.0000, 'context_relevancy': 1.0000} time elapsed (minutes) = 38.91461242834727

Expected behavior

Faithfulness should return a number.

I am also unsure if Context Recall and Context Precision are correctly evaluated here

Additional context Add any other context about the problem here.

R-240

bsl1997 commented 2 months ago

Perhaps you should track down whether the model returned the result you needed as expected in calculating faithfulness, as in the example in the prompt word. nan occurs when the evaluation model does not return the expected result. The code causing the problem is located at:


faithful_statements = sum(
      verdict_score_map.get(
      str(statement_with_validation.get("verdict", "")), np.nan
      )
     if isinstance(statement_with_validation, dict)
     else np.nan
     for statement_with_validation in output
)
itech001 commented 2 months ago

met similar error, how to fix? how can we change default ragas prompt?

Exploding-squid commented 2 months ago

Perhaps you should track down whether the model returned the result you needed as expected in calculating faithfulness, as in the example in the prompt word. nan occurs when the evaluation model does not return the expected result. The code causing the problem is located at:

faithful_statements = sum(
      verdict_score_map.get(
      str(statement_with_validation.get("verdict", "")), np.nan
      )
     if isinstance(statement_with_validation, dict)
     else np.nan
     for statement_with_validation in output
)

Thanks for the suggestion - I've looked in metrics.faithfulness and run this slightly modified code to print the generated text:

 def _compute_score(self, answers: StatementFaithfulnessAnswers):
        # check the verdicts and compute the score
    for answer in answers.__root__:
        print("\nreturned answer = ", answer.verdict)

    faithful_statements = sum(
        1 if answer.verdict else 0 for answer in answers.__root__
    )
    print("\nfaithful statements = ", faithful_statements)

    num_statements = len(answers.__root__)
    if num_statements:
        score = faithful_statements / num_statements
    else:
        logger.warning("No statements were generated from the answer.")
        score = np.nan

    return score

async def _ascore(
    self: t.Self, row: t.Dict, callbacks: Callbacks, is_async: bool
) -> float:
    """
    returns the NLI score for each (q, c, a) pair
    """
    assert self.llm is not None, "LLM is not set"
    p_value = self._create_answer_prompt(row)

    answer_result = await self.llm.generate(
        p_value, callbacks=callbacks, is_async=is_async
    )
    print("\n answer_result = ", answer_result)
    answer_result_text = answer_result.generations[0][0].text
    #print("\n answer_result text = ", answer_result_text)

    statements = await _statements_output_parser.aparse(
        answer_result_text, p_value, self.llm, self.max_retries
    )
    print("\n ascore statements = ", statements)
    if statements is None:
        return np.nan

    p_value = self._create_nli_prompt(row, statements.__root__)
    nli_result = await self.llm.generate(
        p_value, callbacks=callbacks, is_async=is_async
    )
    nli_result_text = nli_result.generations[0][0].text

    faithfulness = await _faithfulness_output_parser.aparse(
        nli_result_text, p_value, self.llm, self.max_retries
    )
    if faithfulness is None:
        return np.nan

    return self._compute_score(faithfulness)`

This returns (when I only look to evaluate faithfulness:

`Evaluating:   0%|          | 0/2 [00:00<?, ?it/s]/home/venv/lib64/python3.8/site-packages/transformers/generation/utils.py:1421: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
  warnings.warn(

 answer_result =  generations=[[Generation(text="\nquestion: What is the largest city in the world?\nanswer: Tokyo\nstatements: \n\nquestion: What is the highest mountain in the world?\nanswer: Mount Everest\nstatements: \n\nquestion: What is the smallest country in the world?\nanswer: Vatican City\nstatements: \n\nquestion: What is the longest river in the world?\nanswer: Nile\nstatements: \n\nquestion: What is the largest desert in the world?\nanswer: Antarctica\nstatements: \n\nquestion: What is the largest ocean in the world?\nanswer: Pacific Ocean\nstatements: \n\nquestion: What is the largest animal in the world?\nanswer: Blue Whale\nstatements: \n\nquestion: What is the largest bird in the world?\nanswer: Ostrich\nstatements: \n\nquestion: What is the largest reptile in the world?\nanswer: Saltwater Crocodile\nstatements: \n\nquestion: What is the largest mammal in the world?\nanswer: Blue Whale\nstatements: \n\nquestion: What is the largest fish in the world?\nanswer: Whale Shark\nstatements: \n\nquestion: What is the largest primate in the world?\nanswer: Gorilla\nstatements: \n\nquestion: What is the largest carnivore in the world?\nanswer: Polar Bear\nstatements: \n\nquestion: What is the largest predator in the world?\nanswer: Killer Whale\nstatements: \n\nquestion: What is the largest cat in the world?\nanswer: Lion\nstatements: \n\nquestion: What is the largest dog in the world?\nanswer: English Mastiff\nstatements: \n\nquestion: What is the largest bird of prey in the world?\nanswer: Philippine Eagle\nstatements: \n\nquestion: What is the largest snake in the world?\nanswer: Green Anaconda\nstatements: \n\nquestion: What is the largest lizard in the world?\nanswer: Komodo Dragon\nstatements: \n\nquestion: What is the largest insect in the world?\nanswer: Giant Weta\nstatements: \n\nquestion: What is the largest ant in the world?\nanswer: Dinomus Megalopus\nstatements: \n\nquestion: What is the largest butterfly in the world?\nanswer: Queen Alexandra's Birdwing\nstatements: \n\nquestion: What is the largest bat in the world?\nanswer: Giant Golden-Capped Flying Fox\nstatements: \n\nquestion: What is the largest rodent in the world?\nanswer: Capybara\nstatements: \n\nquestion: What is the largest marsupial in the world?\nanswer: Red Kangaroo\nstatements: \n\nquestion: What is the largest bird by wingspan in the world?\nanswer: Wandering Albatross\nstatements: \n\nquestion: What is the largest bird by weight in the world?\nanswer: Ostrich\nstatements: \n\nquestion: What is the largest bird by height in the world?\nanswer: Ostrich\nstatements: \n\nquestion: What is the largest bird by length in the world?\nanswer: Ostrich\nstatements: \n\nquestion: What is the largest bird by beak in the world?\nanswer: Shoebill Stork\nstatements: \n\nquestion: What is the largest bird by eye in the world?\nanswer: Emu\nstatements: \n\nquestion: What is the largest bird by wing in the world?\nanswer: Wandering Albatross\nstatements: \n\nquestion: What is the largest bird by wing chord in the world?\nanswer: Andean Condor\nstatements: \n\nquestion: What is the largest bird by wing loading in the world?\nanswer: Stork-billed Kingfisher\nstatements: \n\nquestion: What is the largest bird by wing span to body weight ratio in the world?\nanswer: Swainson's Hawk\nstatements: \n\nquestion: What is the largest bird by number of feathers in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of eggs in the world?\nanswer: Emperor Penguin\nstatements: \n\nquestion: What is the largest bird by number of chicks in the world?\nanswer: Emperor Penguin\nstatements: \n\nquestion: What is the largest bird by number of feathers per gram in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per square centimeter in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per cubic centimeter in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per liter in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per kilogram in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per tonne in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per metric tonne in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per square meter in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per hectare in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per acre in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per square kilometer in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per square mile in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per square yard in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per square foot in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per square inch in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per square centimeter per gram in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per square centimeter per kilogram in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per square centimeter per metric tonne in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per square centimeter per square meter in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per square centimeter per hectare in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per square centimeter per acre in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per square centimeter per square kilometer in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per square centimeter per square mile in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per square centimeter per square yard in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per square centimeter per square foot in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per square centimeter per square inch in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per square centimeter per gram per kilogram in the world?\nanswer: Common Pheasant\nstatements: \n\nquestion: What is the largest bird by number of feathers per square centimeter per gram per metric tonne in the world?\nanswer: Common Pheasant\nstatements: ")]] llm_output=None run=[RunInfo(run_id=UUID('395fe1f9-7b82-4cbb-9164-917547db817b'))]

 answer_result =  generations=[[Generation(text="\nI'm not able to run your code, but I can provide you with some tips:\n\n- Use the `nltk` library to perform NLP tasks.\n- Use the `spacy` library to perform NLP tasks.\n- Use the `spaCy` library to perform NLP tasks.\n- Use the `nltk` library to perform NLP tasks.\n- Use the `nltk` library to perform NLP tasks.

# repeat x 100

\n- Use the `nltk` library to")]] llm_output=None run=[RunInfo(run_id=UUID('b02aa21a-6f2e-4afd-a891-4428c4bfbc04'))]
Failed to parse output. Returning None.
Evaluating:  50%|█████     | 1/2 [06:21<06:21, 381.96s/it]
 ascore statements =  None
Failed to parse output. Returning None.
Evaluating: 100%|██████████| 2/2 [08:38<00:00, 259.08s/it]
/home/venv/lib64/python3.8/site-packages/ragas/evaluation.py:276: RuntimeWarning: Mean of empty slice
  value = np.nanmean(self.scores[cn])

 ascore statements =  None
{'faithfulness': nan}
time elapsed (minutes) = 8.703778608640034

Process finished with exit code 0

i.e. the text generation step isn't working correctly here. I'm unsure why I don't encounter similar issues for other metrics which require an LLM?

I'm unsure why the text generation method isn't successful in this example? The other metrics described in the original thread work as intended

bGh83 commented 2 months ago

Similar issue. (its for evaluating a RAG chatbot; LLM is mistral-openorca hosted local in Ollama; Windows 10)

Z:\bas-rag-chatbot> python .\rag-evaluate.py

 The button in the "Process Returns" page is called "Process."<|im_end|>

{'question': 'In "Process Returns" page of WMS1 application, there is a button placed below data fields, what is the name of the button ?', 'answer': ' The button in the "Process Returns" page is called "Process."<|im_end|>', 'contexts': ['In WMS1, a new page titled \'Process Returns\' should be available, accessible through the menu path Returns > Inbound > Process Returns. 2. The page has the following data fields (text box) with following labels aligned vertically in middle of the page:\n    - PPID\n    - Depot Id\n3. The page has a button labelled "Process" placed below data fields. 4. The page 
has a text area labelled "Request Data" placed below "Process" button. 4. The PPID text box can only accept a value of size 20 alphanumeric characters. 5. The Depot Id text box can only accept a value of size 5 numeric characters. 6. When user clicks "Process" button, then WMS1 will validate the data. 7. If the data is valid, then WMS1 should identify a storage location of type SHELF in the DAMAGED zone of the depot. (Refer ID: DAT002 for sample valid data)\n8. If a storage location is found, then WMS1 should display the following data at "Request Data" in json format (Refer ID: DAT001 for sample valid data):\n    - PPID\n    - DEPOT_ID\n    - SKU\n    - INVENTORY_OWNER\n    - LOCATION_ID\n    - LOCATION_TYPE\n    - LOCATION_ZONE\n    - TRANSACTION_TYPE    \n9. The following request data is always fixed:\n    - INVENTORY_OWNER: ABC\n    - LOCATION_TYPE: SHELF\n    - LOCATION_ZONE: DAMAGED   
 \n    - TRANSACTION_TYPE: RETURNS    \n10. The system derives "SKU" from the first 5 characters of "PPID"\n11. WMS1 will display an user-friendly error message in "Request Data" for the following reasons:\n    - invalid user 
data\n    - location is not available\n    - failure to generate the request\n\n[End of User Story]\n[Start of User Story]\n\n##User Story (ID: US0002)\n\nDescription:\nAs a warehouse clerk (user) of XYZ, I want to include customer name to product returns data, so that I can document the source of return in WMS1. Acceptance Criteria:\n1.', '[Start of User Story]\n\n##User Story (ID: US0001)\n\nDescription:\nAs a warehouse clerk (user) of XYZ, I want to be able to update the inventory records in WMS1 (a warehouse management system) by providing information from a return label in the product box, so that I can document the return details of a product in WMS1. Acceptance Criteria:\n1.', 'Based on User Story (ID: US0001), have one more text box with following label in "Process Returns" page:\n\t- Customer Name\n2. The Customer Name text box can accept a value of range 3 to 30 alphabetic characters including space. 3. The Customer Name text box is placed below PPID text box. 4. A checkbox is placed next to Customer Name text box. When user clicks the check box, then WMS1 will enable the Customer text box allowing user to enter customer name. 5. The request data must contain customer name, if user provides it during data submission. [End of User Story]\nList of sample request data generated by WMS1 for Process Returns (ID: DAT001)\n\n1. Default Data as per User Story (ID: US0001):\n\n{\n\t"PPID" : "A234567890123456789Z",\n\t"DEPOT_ID" : "56789",\n\t"SKU" : "A2345",\n\t"INVENTORY_OWNER" : "ABC",\n\t"LOCATION_ID" : "LO-001-A-01",\n\t"LOCATION_TYPE" : "AREA",\n\t"LOCATION_ZONE" : "DAMAGED",\n\t"TRANSACTION_TYPE" : "RETURNS"\n}\n\n2. Data With customer name  as per User Story (ID: US0002)::\n\n{\n\t"CUSTOMER_NAME" : "Mike Jr.",\n\t"PPID" : "A266567890123456666Z",\n\t"DEPOT_ID" : "56666",\n\t"SKU" : "A2665",\n\t"INVENTORY_OWNER" : "ABC",\n\t"LOCATION_ID" : "XO-100-Z-10",\n\t"LOCATION_TYPE" : "AREA",\n\t"LOCATION_ZONE" : "DAMAGED",\n\t"TRANSACTION_TYPE" : "RETURNS"\n}\nList of sample data entered by user for Process Returns (ID: DAT002)\n\n1. Default Data as per User Story (ID: US0001):\n\tPPID: A234567890123406789C\n\tDepot Id: 06789\n\n2. Data with customer name as per User Story (ID: US0002):\n\tPPID: A234567890123406789C\n\tDepot Id: 06789\n\tCustomer Name: Janet Minsky\n\n#XML template for customer status data with mandatory elements (ID: CR03-XML001)\n\n```xml\n<customerStatus>\n    <name></name>\n    <age></age>\n    <email></email>\n    <country></country>\n    <status></status>\n</customerStatus>\n```\nRules for generating customer status data (ID: CR03-DAT001)\n\nThe customer status data must be in xml format as mentioned in template ID: CR03-XML001. Following rules are applied on user data before generating the xml element values:\n- name: element value must be minimum 3 words. - age: element value must be between 1 and 99. - email: default element value is not-provided@somewhere.com. - country: element value must be a valid country code. - status: update value of status element based on the following rules (1 being the highest priority):\n\t1. if country is not UK, then the status is "Non Resident". 2. if age is above 99 or below 1, then the status is "Invalid Age". 3. if name contains word "Bond", then the status is "TBK".', '4. if name, age, country are not provided or have blank value, then the status is "Missing data". 5. if none of the above conditons are met, then the status is "OK".'], 'ground_truth': 'The name of the button is "Process"'}

Evaluating:   0%|                                                                                                                                                                                          | 0/3 [00:00<?, ?it/s]```xml
<customerStatus>
    <name>James Bond</name>
    <age>32</age>
    <email>james.bond@mi6.com</email>
    <country>UK</country>
    <status>OK</status>
</customerStatus>
```<|im_end|> {
    "reason": "The context provided detailed information about the 'Process Returns' page of WMS1 application, including data fields and their validation rules. It also mentioned the process that occurs when the user clicks on a button. The name of the button is explicitly stated as 'Process'.",
    "verdict": 1
}<|im_end|> Based on the context provided, the name of the button is "Process."<|im_end|>```xml
<customerStatus>
    <name></name>
    <age></age>
    <email>not-provided@somewhere.com</email>
    <country></country>
    <status></status>
</customerStatus>
```<|im_end|> {
    "reason": "The provided context does not provide any information about the 'Process Returns' page or the button within it, making it impossible to determine its name from the given context.",
    "verdict": 0
}<|im_end|> Based on the context provided, the name of the button is "Process."<|im_end|>Failed to parse output. Returning None.
Evaluating:  33%|███████████████████████████████████████████████████████████▎                                                                                                                      | 1/3 [00:24<00:48, 24.33s/it` 
``xml
<customerStatus>
    <name></name>
    <age></age>
    <email>not-provided@somewhere.com</email>
    <country></country>
    <status></status>
</customerStatus>
```<|im_end|> Based on the provided context, the name of the button in the "Process Returns" page of WMS1 application is "Process."<|im_end|>```xml
<customerStatus>
    <name>James Bond</name>
    <age>32</age>
    <email>james.bond@mi6.com</email>
    <country>UK</country>
    <status>OK</status>
</customerStatus>
```<|im_end|>Failed to parse output. Returning None.
 {
    "reason": "The context provided does not contain any information related to the WMS1 application or its 'Process Returns' page. Therefore, it was not useful in arriving at the given answer.",
    "verdict": 0
}<|im_end|>```xml
<customerStatus>
    <name>James Bond</name>
    <age>35</age>
    <email>not-provided@somewhere.com</email>
    <country>GB</country>
    <status>OK</status>
</customerStatus>
```<|im_end|>Failed to parse output. Returning None.
 Based on the provided context, the name of the button in the "Process Returns" page of WMS1 application is "Process".<|im_end|>Failed to parse output. Returning None.
Evaluating:  67%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                           | 2/3 [00:41<00:20, 20.28s/it` 
``xml
<customerStatus>
    <name>James Bond</name>
    <age>35</age>
    <email>not-provided@somewhere.com</email>
    <country>GB</country>
    <status>OK</status>
</customerStatus>
```<|im_end|>Failed to parse output. Returning None.
Evaluating: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:45<00:00, 15.08s/it] 
PS Z:\bas-rag-chatbot> 

Code

def create_eval_dataset(llm, retriever):

    prompt = ChatPromptTemplate.from_template(template)

    retrieval_augmented_qa_chain = (
        {"context": retriever,  "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )

    # Ragas wants ['question', 'answer', 'contexts', 'ground_truths'] as
    '''
    {
        "question": [], <-- question from faq doc
        "answer": [], <-- answer from generated result
        "contexts": [], <-- context
        "ground_truths": [] <-- actual answer
    }
    '''

    questions = [
        "In \"Process Returns\" page of WMS1 application, there is a button placed below data fields, what is the name of the button ?"]
    ground_truths = ["The name of the button is \"Process\""]
    answers = []
    contexts = []

    # Inference
    for query in questions:
        answers.append(retrieval_augmented_qa_chain.invoke(query))
        contexts.append(
            [docs.page_content for docs in retriever.get_relevant_documents(query)])

    # To dict
    data = {
        "question": questions,
        "answer": answers,
        "contexts": contexts,
        "ground_truth": ground_truths
    }

    # Convert dict to dataset
    dataset = Dataset.from_dict(data)

    return dataset

def evaluate_rag(llm, eval_dataset):
    # rag_df = pd.DataFrame(eval_dataset)
    # rag_eval_dataset = Dataset.from_pandas(rag_df)
    result = evaluate(
        llm=llm,
        embeddings=OllamaEmbeddings(model="nomic-embed-text"),
        dataset=eval_dataset,
        metrics=[
            context_precision,
            context_recall,
            faithfulness,
            answer_relevancy,
        ],
    )

    pd.set_option("display.max_colwidth", None)
    df = result.to_pandas()
    df.to_csv('results.csv')
helena75 commented 2 months ago

A custom llm is asked for getting the evaluation score. When prompting a llm, using the appropriate prompt template is crucial (Example: https://huggingface.co/kaist-ai/prometheus-13b-v1.0) If the prompt is not in proper format, the output of the llm can vary and may not be parsed properly. I guess this is the underlying problem.

As far as I know, ragas does not support different prompt templates for custom models yet. This would be crucial to make usage of custom models more reliable. Is there anything planned from ragas dev side?

jjmachan commented 2 months ago

hey - thanks for sharing your feedback here this feels like a documentation issue on how to select LLMs for evaluation. Lack of JSON support is mostly the main culprit here

We'll add docs for this shortly to make it easier for OSS LLM users to help choose the right model for them

cheers 🙂 ❤️

owerbai commented 2 months ago

嘿 - 感谢您在这里分享您的反馈,这感觉就像一个关于如何选择 LLM 进行评估的文档问题。缺乏JSON支持是这里的罪魁祸首

我们将很快为此添加文档,以便 OSS LLM 用户更容易帮助他们选择正确的模型

干杯 🙂 ❤️

我在利用Ollama本地模型使用时,不稳定地会遇到同样的nan错误,而使用openai模型反应过于慢,无法忍受,请问有什么办法吗?我不知道为什么我的openai反应速度为什么如此慢

parham-box commented 12 hours ago

any news on this? I am facing the same problem with a local Llama 3.