Custom dataset evaluation giving ValueError: Input, actual output, and retrieval context cannot be None

piseabhijeet commented 8 months ago

When I create an evaluation dataset using EvaluationDataset and I use it with LLMTestCase, I get this:

  File "/Users/abhijeet.pise/miniconda3/envs/evaluation/lib/python3.12/site-packages/deepeval/evaluate.py", line 174, in evaluate
    test_results = execute_test(test_cases, metrics, True)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/abhijeet.pise/miniconda3/envs/evaluation/lib/python3.12/site-packages/deepeval/evaluate.py", line 76, in execute_test
    metric.measure(test_case)
  File "/Users/abhijeet.pise/miniconda3/envs/evaluation/lib/python3.12/site-packages/deepeval/metrics/faithfulness/faithfulness.py", line 43, in measure
    raise ValueError(
ValueError: Input, actual output, and retrieval context cannot be None

I am sure my dataset has all Input, actual output, and retrieval context values..

penguine-ip commented 8 months ago

Hey @piseabhijeet , if you're using it with LLMTestCase then it seems like for some of your test cases in your evaluation dataset the retrieval context is missing. Can you try with just one test case in your evaluation dataset and slowly work it up to debug this issue?

piseabhijeet commented 8 months ago

Hi @penguine-ip

Actually I tried running the same for two metrics - correctness and hallucination - it works perfectly.. but when I try to include all the other metrics like faithfulness, contextual similarity, it fails. this is weird..

if it was a missing value issue, the two metrics wouldn't have run successfully.

penguine-ip commented 8 months ago

Oh i see, correctness and hallcuination does not require retrieval_context but other metrics you mentioned does.

I assume you only provided context but not retrieval_context. Are you building something RAG? if you are, don't use hallucination and switch to faithfulness altogether.

piseabhijeet commented 8 months ago

I see, got your point! Thanks for the quick response.

Also if a dataset is created inside a module and metrics are passed as list, do we explicitly specify number of workers? if yes, where do I set it?

I see '-n' as parameter on command line but was wondering if a similar parameter is there inside dataset.evaluate module.

piseabhijeet commented 8 months ago

Where are evaluation results stored? I see a temporary json file in the same folder but its not getting updated when running bulk eval.

penguine-ip commented 8 months ago

@piseabhijeet You should be able to store results locally using this instruction: https://docs.confident-ai.com/docs/getting-started#create-your-first-test-case

Let me know if it doesn't work. There's always the option to login to Confident AI too with this one liner in the CLI: deepeval login

You can also do it in python code: https://docs.confident-ai.com/docs/confident-ai-introduction (scroll to bottom)

Not yet, right now we execute each test case individually, and then for each test case execute each metric individually (so like a double for loop)

We're going to be releasing a big update tomorrow that allows you to run metrics concurrently on a test case, (so a single for loop)

You don't have to specify workers for this, we'll handle everything under the hood with coroutines. If you are using a custom model (eg., https://docs.confident-ai.com/docs/metrics-introduction#using-a-custom-llm), let me know because I'll need to help you support async execution.

The -n flag in deepeval test run allows you to run multiple test cases at once using multiple processes in parallel. We're not supporting multiprocessing for the evaluate function for now.

penguine-ip commented 8 months ago

@piseabhijeet also come join our discord, there will be major updates over the weekend and any breaking changes will be updated there: https://discord.com/invite/a3K9c8GRGt

piseabhijeet commented 8 months ago

thanks for the prompt response @penguine-ip !

penguine-ip commented 8 months ago

hey @piseabhijeet, updates on discord, but in short evaluate function now runs all metrics concurrently for each test case: https://docs.confident-ai.com/docs/evaluation-test-cases#evaluate-test-cases-in-bulk

It's a big changes so please let me know if you catch any bugs!

piseabhijeet commented 8 months ago

This is amazing! Looking forward to try it this week!

piseabhijeet commented 8 months ago

@penguine-ip I was trying to run it on my dataset with around 1300 query response pairs. I could see the number of API requests to the custom llm more than 1300. Can you explain the same?

The code FYI:

file_path="query-response-context.json"

class AzureManager:
    def __init__(self, context):
        self.context = context
        self.openai_api_version = self.context['OPENAI_API_VERSION']
        self.azure_deployment = self.context['DEPLOYMENT_NAME']
        # self.azure_endpoint = self.context['azure_endpoint']
        self.model_name = self.context['MODEL_NAME']
        self.openai_api_type = self.context['OPENAI_API_TYPE']
        self.tenant_id = self.context['TENANT_ID']
        self.openai_scope = self.context['OPEN_AI_SCOPE']
        self.openai_api_base = self.context['OPENAI_API_BASE']

    def get_model_for_evaluation(self):
        openai.api_type = self.openai_api_type
        token_credential = AzureCliCredential(tenant_id=self.tenant_id)
        token = token_credential.get_token(self.openai_scope)
        openai.api_key = token.token
        custom_model = AzureChatOpenAI(
            deployment_name=self.azure_deployment,
            model_name=self.model_name,
            openai_api_key= openai.api_key,
            azure_endpoint=self.openai_api_base,
            openai_api_version=self.openai_api_version
        )

        return AzureOpenAI(model=custom_model)

class AzureOpenAI(DeepEvalBaseLLM):
    def __init__(
        self,
        model
    ):
        self.model = model
    def load_model(self):
        return self.model

    # def _call(self, prompt: str) -> str:
    #     chat_model = self.load_model()
    #     return chat_model.invoke(prompt).content

    def generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        return chat_model.invoke(prompt).content

    async def a_generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        res = await chat_model.ainvoke(prompt)
        return res.content

    def get_model_name(self):
        return "Custom Azure OpenAI Model"

def create_dataset_for_llm_metrics(df):
    from deepeval.test_case import LLMTestCase
    from deepeval.dataset import EvaluationDataset

    def create_llm_test_case(row):
        if (not isinstance(row['context'], list)):
            if row['context'] == "\n" or row['context'] == None:
                row['context'] = ["Not available"]
            else:
                row['context'] = [str(row['context'])]

        test_case = LLMTestCase(
            input=row['jsonPayload.UserQuery'],
            actual_output=row['jsonPayload.Response'],
            expected_output="Not Applicable",
            retrieval_context=row['context'],
            context=row['context']
        )
        return test_case
    df['test_cases'] = df.apply(lambda x: create_llm_test_case(x), axis=1)
    test_cases_list = df['test_cases'].values.tolist()

    dataset = EvaluationDataset(test_cases=test_cases_list)
    return dataset

df = pd.read_json(file_path)  
dataset = create_dataset_for_llm_metrics(df)

azure_openai = AzureManager(params).get_model_for_evaluation()

@pytest.mark.parametrize(
    "test_case",
    dataset,
)
def test_customer_chatbot(test_case: LLMTestCase):
    # faith_metric = FaithfulnessMetric(threshold=0.3, model=azure_openai)
    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5, model=azure_openai)
    context_relevance_metric = ContextualRelevancyMetric(threshold=0.7, model=azure_openai)
    assert_test(test_case, [answer_relevancy_metric])

@deepeval.on_test_run_end
def function_to_be_called_after_test_run():
    print("Test finished!")

piseabhijeet commented 8 months ago

Also I notice that some of the test cases are missing in the .json file output and the code never stops executing. Last terminal output:

penguine-ip commented 8 months ago

Hey! I’ve limited web access right now but I have discord available, do you mind copy and pasting it there so I can help out asap? (Can’t view the code and photo)

On Sun, 10 Mar 2024 at 18:14, Abhijeet Pise @.***> wrote:

Also I notice that some of the test cases are missing in the .json file output and the code never stops executing. Last terminal output: Screenshot.2024-03-10.at.15.44.22.png (view on web) https://github.com/confident-ai/deepeval/assets/57705684/e7ca9f2d-f1e1-4b0e-a5ba-e0689486bb93

— Reply to this email directly, view it on GitHub https://github.com/confident-ai/deepeval/issues/568#issuecomment-1987172163, or unsubscribe https://github.com/notifications/unsubscribe-auth/BCFQK6ZO35MLJFYIMQXAPFTYXQXBXAVCNFSM6AAAAABEI2KR52VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBXGE3TEMJWGM . You are receiving this because you were mentioned.Message ID: @.***>

confident-ai / deepeval

Custom dataset evaluation giving ValueError: Input, actual output, and retrieval context cannot be None #568