Open guild-ttreece opened 5 months ago
Hm, I haven't been able to replicate this issue yet.
Here's my config:
{
"testdata_path": "example_input/qa.jsonl",
"results_dir": "example_results/experiment<TIMESTAMP>",
"requested_metrics": ["gpt_relevance", "answer_length", "latency"],
"target_url": "https://app-backend-j25rgqsibtmlo.azurewebsites.net/chat",
"target_parameters": {
"overrides": {
"semantic_ranker": false,
"prompt_template": "<READFILE>example_input/prompt_refined.txt"
}
}
}
You can actually try that config out since that particular URL is publicly available, to see if it works for you.
Can you share the full logs (with anything private redacted)?
Hi @pamelafox ! Thank you for the response! I did run using your config example and have no issues, the issues happen when I try and use any of the custom metrics with the jinja2 templates. When using those it seems like it does not accurately calculate the metrics and ends up dropping all of the results for any of the custom metrics. I was able to get past the KeyError, but now it's basically just creating an empty column in the dataframe where I am trying to use any of the custom metrics with jinja2 templates. Re-ran it a second time with the following config and did run into the KeyError again:
{
"testdata_path": "example_input/qa_test.jsonl",
"results_dir": "example_results/experiment<TIMESTAMP>",
"requested_metrics": ["gpt_relevance", "groundedness", "answer_length", "latency"],
"target_url": "https://app-backend-j25rgqsibtmlo.azurewebsites.net/chat",
"target_parameters": {
"overrides": {
"semantic_ranker": false,
"prompt_template": "<READFILE>example_input/prompt_refined.txt"
}
}
}
Here is more of the traceback as well.
Which logs would help you the most?
Hi @pamelafox First of all, thank you for providing this cool evaluation tool. Interestingly, my clients have also reported similar issues under following environment:
Australia East
gpt-4 1106-Preview
Could it be related to differences in the evaluation metrics for supported scenarios, as mentioned in the following article? Ref: How to evaluate with Azure AI Studio and SDK
Hi @ks6088ts , Good to know it isn't just me that is experiencing this! Have your users found any solution or workaround? I have tried loads of things but no luck yet.
Thanks!
I tried to replicate this again but still no luck. @guild-ttreece Are you definitely getting results for the non-custom metrics, like "gpt_relevance"? I'm wondering if all the GPT metrics are failing, but the custom metrics are failing more spectacularly. You can check the eval_results.jsonl file and CTRL+F for "gpt_relevance" to see the values that get recorded.
If it's all the GPT-metrics, but the test call to GPT-4 works, then I guess it's a possible that you're using a version of the model that doesn't support what's needed (like function calling), so it could be worth experimenting with different GPT-4 versions.
Hi @pamelafox - I am only experiencing the error when trying to use custom local metrics. Using the built-in GPT metrics there are no issues. Values are recorded and calculated with no issues if I am using the built-in GPT metrics. As soon as I switch to any of the local metrics that is when I encounter the issues.
@guild-ttreece @pamelafox
Just for your info: According to my clients, following settings worked properly.
Australia East
gpt-4-32k 0613
@ks6088ts - thank you for the information! I will see if we can try this on my side and see if we see any better results.
This issue is for a: (mark with an
x
)Minimal steps to reproduce
Any log messages given by the failure
Expected/desired behavior
OS and Version?
Mention any other details that might be useful
I am using the same test data path and the only change I am making is to use the local prompt via the Jinja2 templates. My goal is to be able to customize the metrics to my use case and add additional metrics as well but am encountering this issue whenever attempting to use a non-built-in metric.
Here is an example of the metric response received:
'latency': 9.598966, 'relevance': None, 'answer_length': 615, 'gpt_coherence': 5, 'gpt_groundedness': 5, 'gpt_groundedness_reason': '(Failed)'
"Relevance" is the metric changed to use the local prompt that where it seems to have an issue with the value, although this has happened when attempting to utilize any of the local prompts.