Azure-Samples / ai-rag-chat-evaluator

Tools for evaluation of RAG Chat Apps using Azure AI Evaluate SDK and OpenAI
MIT License
209 stars 75 forks source link

When using the local metrics with prompt getting Key Error during evaluation - values returning as None #77

Open guild-ttreece opened 5 months ago

guild-ttreece commented 5 months ago

Please provide us with the following information:

This issue is for a: (mark with an x)

- [X ] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Change from built-in metrics to custom metrics listed in the scripts\evaluate_metrics\prompts folder. Modify example_config.json to contain:

Any log messages given by the failure

raise KeyError(f"None of [{key}] are in the [{axis_name}]") KeyError: "None of [Index(['relevance_score'], dtype='object')] are in the [columns]"

Expected/desired behavior

Jinja prompt templates to be used and evaluated with scores and then evaluated.

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?) Windows 11

Versions

v22H2

Mention any other details that might be useful

I am using the same test data path and the only change I am making is to use the local prompt via the Jinja2 templates. My goal is to be able to customize the metrics to my use case and add additional metrics as well but am encountering this issue whenever attempting to use a non-built-in metric.

Here is an example of the metric response received:

'latency': 9.598966, 'relevance': None, 'answer_length': 615, 'gpt_coherence': 5, 'gpt_groundedness': 5, 'gpt_groundedness_reason': '(Failed)'

"Relevance" is the metric changed to use the local prompt that where it seems to have an issue with the value, although this has happened when attempting to utilize any of the local prompts.


Thanks! We'll be in touch soon.

pamelafox commented 5 months ago

Hm, I haven't been able to replicate this issue yet.

Here's my config:

{
    "testdata_path": "example_input/qa.jsonl",
    "results_dir": "example_results/experiment<TIMESTAMP>",
    "requested_metrics": ["gpt_relevance", "answer_length", "latency"],
    "target_url": "https://app-backend-j25rgqsibtmlo.azurewebsites.net/chat",
    "target_parameters": {
        "overrides": {
            "semantic_ranker": false,
            "prompt_template": "<READFILE>example_input/prompt_refined.txt"
        }
    }
}

You can actually try that config out since that particular URL is publicly available, to see if it works for you.

Can you share the full logs (with anything private redacted)?

guild-ttreece commented 5 months ago

Hi @pamelafox ! Thank you for the response! I did run using your config example and have no issues, the issues happen when I try and use any of the custom metrics with the jinja2 templates. When using those it seems like it does not accurately calculate the metrics and ends up dropping all of the results for any of the custom metrics. I was able to get past the KeyError, but now it's basically just creating an empty column in the dataframe where I am trying to use any of the custom metrics with jinja2 templates. Re-ran it a second time with the following config and did run into the KeyError again:

{
    "testdata_path": "example_input/qa_test.jsonl",
    "results_dir": "example_results/experiment<TIMESTAMP>",
    "requested_metrics": ["gpt_relevance", "groundedness", "answer_length", "latency"],
    "target_url": "https://app-backend-j25rgqsibtmlo.azurewebsites.net/chat",
    "target_parameters": {
        "overrides": {
            "semantic_ranker": false,
            "prompt_template": "<READFILE>example_input/prompt_refined.txt"
        }
    }
}

Here is more of the traceback as well.

keyerror_logs.txt

Which logs would help you the most?

ks6088ts commented 5 months ago

Hi @pamelafox First of all, thank you for providing this cool evaluation tool. Interestingly, my clients have also reported similar issues under following environment:

Could it be related to differences in the evaluation metrics for supported scenarios, as mentioned in the following article? Ref: How to evaluate with Azure AI Studio and SDK

guild-ttreece commented 5 months ago

Hi @ks6088ts , Good to know it isn't just me that is experiencing this! Have your users found any solution or workaround? I have tried loads of things but no luck yet.

Thanks!

pamelafox commented 5 months ago

I tried to replicate this again but still no luck. @guild-ttreece Are you definitely getting results for the non-custom metrics, like "gpt_relevance"? I'm wondering if all the GPT metrics are failing, but the custom metrics are failing more spectacularly. You can check the eval_results.jsonl file and CTRL+F for "gpt_relevance" to see the values that get recorded.

If it's all the GPT-metrics, but the test call to GPT-4 works, then I guess it's a possible that you're using a version of the model that doesn't support what's needed (like function calling), so it could be worth experimenting with different GPT-4 versions.

guild-ttreece commented 5 months ago

Hi @pamelafox - I am only experiencing the error when trying to use custom local metrics. Using the built-in GPT metrics there are no issues. Values are recorded and calculated with no issues if I am using the built-in GPT metrics. As soon as I switch to any of the local metrics that is when I encounter the issues.

ks6088ts commented 4 months ago

@guild-ttreece @pamelafox

Just for your info: According to my clients, following settings worked properly.

guild-ttreece commented 4 months ago

@ks6088ts - thank you for the information! I will see if we can try this on my side and see if we see any better results.