Closed cpatrickalves closed 4 months ago
We're looking into this now, as I've replicated with English as well.
@cpatrickalves While the SDK team is investigating the 1s, I've extended this tool to support custom metrics and made those the default, based off the prompts used by the SDK. That way you can easily customize them and even localize them. See this PR:
https://github.com/Azure-Samples/ai-rag-chat-evaluator/pull/50
It should now be easier to add your own custom metrics as well.
That's awesome @pamelafox, I will test and compare the results using translated prompts. I'll let you know the results.
We're looking into this now, as I've replicated with English as well.
HiPamelafox, I tested it for 200 inputs for 5 metrics (fluency, coherence, groundness, relevance, and similarity). For each metric, the score is 1 or 5 only, not 2, 3, or 4.
@bhaskarturkar That was using the custom local versions of those metrics, or using the built-in metrics? I have a fix coming for the built-in metrics. They do tend to be fairly bimodal, but you should get at least some 2-4s. I'll do a full evaluation with the fix to check out the range.
Hi , we are using the built in metrics( gpt_coherence, gpt_similarity, gpt_fluency, gpt_relevance, gpt_groundedness ). For all the metrics we either got 1 or 5, not single score between 2 to 4 for any metrics. This is the code we are trying to test with.
result = evaluate(
evaluation_name="my-qa-eval-with-data",
data=jsonl_data,
task_type="qa",
metrics_list=["gpt_groundedness","gpt_relevance","gpt_coherence","gpt_fluency","gpt_similarity"],
model_config= {
"api_version": "",
"api_base": "",
"api_type": "",
"api_key": '',
"deployment_id": ''
},
data_mapping={
"questions":"question",
"contexts":"context",
"answer":"answer",
"ground_truth":"groundtruth"
},
output_path="./sampleresults"
I just merged a fix for the latest version of the SDK: https://github.com/Azure-Samples/ai-rag-chat-evaluator/pull/52/files#diff-72effa77bf8138803cfdb75cd98249445fa04006826cd01b5ce76dd1ebbdfacf
I'm not sure if you're using this repo, but if you're using latest version of SDK, questions should be question and contexts should be context in the data_mapping dict.
Hi @pamelafox, followed your suggestion ("questions" replaced by "question" and "contexts" replaced by "context") now getting scores in the range of 1 to 5.
Thanks
After the updates from https://github.com/Azure-Samples/ai-rag-chat-evaluator/pull/45 I've rerun the evaluation and got the minor score for all metrics:
Before update: (I've omitted the context)
After update:
@pamelafox any idea of what may happen?