Azure-Samples / ai-rag-chat-evaluator

Tools for evaluation of RAG Chat Apps using Azure AI Evaluate SDK and OpenAI
MIT License
162 stars 58 forks source link

Wrong computation of GPT metrics after Update #48

Closed cpatrickalves closed 4 months ago

cpatrickalves commented 4 months ago

After the updates from https://github.com/Azure-Samples/ai-rag-chat-evaluator/pull/45 I've rerun the evaluation and got the minor score for all metrics:

Before update: (I've omitted the context)

    {
            "question": "Qual é o objetivo do Acordo de Cooperação Técnica mencionado na Resolução nº 19.423?",
            "answer": "O objetivo do Acordo de Cooperação Técnica mencionado na Resolução nº 19.423 é a melhoria da gestão das parcerias que envolvem colaboração mútua e interesse público recíproco, bem como promover e estimular ações de capacitação, comunicação e transparência.",
            "context": ...,
            "truth": "O objetivo do Acordo de Cooperação Técnica mencionado na Resolução é promover ações voltadas ao desenvolvimento do 'Projeto Sede de Aprender Nacional'.",
            "gpt_groundedness": 5,
            "gpt_relevance": 5,
            "gpt_coherence": 5
        },

After update:

        {
            "question": "Qual é o objetivo do Acordo de Cooperação Técnica mencionado na Resolução nº 19.423?",
            "truth": "O objetivo do Acordo de Cooperação Técnica mencionado na Resolução é promover ações voltadas ao desenvolvimento do 'Projeto Sede de Aprender Nacional'.",
            "latency": 1.905,
            "answer": "O objetivo do Acordo de Cooperação Técnica mencionado na Resolução nº 19.423 é a melhoria da gestão das parcerias que envolvem colaboração mútua e interesse público recíproco, bem como promover e estimular ações de capacitação, comunicação e transparência.",
            "context": "....",
            "answer_length": 256,
            "has_citation": false,
            "gpt_coherence": 1,
            "gpt_relevance": 1,
            "gpt_groundedness": 1
        },

@pamelafox any idea of what may happen?

pamelafox commented 4 months ago

We're looking into this now, as I've replicated with English as well.

pamelafox commented 4 months ago

@cpatrickalves While the SDK team is investigating the 1s, I've extended this tool to support custom metrics and made those the default, based off the prompts used by the SDK. That way you can easily customize them and even localize them. See this PR:

https://github.com/Azure-Samples/ai-rag-chat-evaluator/pull/50

It should now be easier to add your own custom metrics as well.

cpatrickalves commented 4 months ago

That's awesome @pamelafox, I will test and compare the results using translated prompts. I'll let you know the results.

bhaskarturkar commented 4 months ago

We're looking into this now, as I've replicated with English as well.

HiPamelafox, I tested it for 200 inputs for 5 metrics (fluency, coherence, groundness, relevance, and similarity). For each metric, the score is 1 or 5 only, not 2, 3, or 4.

pamelafox commented 4 months ago

@bhaskarturkar That was using the custom local versions of those metrics, or using the built-in metrics? I have a fix coming for the built-in metrics. They do tend to be fairly bimodal, but you should get at least some 2-4s. I'll do a full evaluation with the fix to check out the range.

bhaskarturkar commented 4 months ago

Hi , we are using the built in metrics( gpt_coherence, gpt_similarity, gpt_fluency, gpt_relevance, gpt_groundedness ). For all the metrics we either got 1 or 5, not single score between 2 to 4 for any metrics. This is the code we are trying to test with.

result = evaluate( 
    evaluation_name="my-qa-eval-with-data", 
    data=jsonl_data,
    task_type="qa", 
    metrics_list=["gpt_groundedness","gpt_relevance","gpt_coherence","gpt_fluency","gpt_similarity"], 
    model_config= { 
            "api_version": "",
            "api_base": "",
            "api_type": "",
            "api_key": '',
            "deployment_id": ''
    },
    data_mapping={
        "questions":"question",
        "contexts":"context",
        "answer":"answer", 
        "ground_truth":"groundtruth" 
        },
    output_path="./sampleresults"  
pamelafox commented 4 months ago

I just merged a fix for the latest version of the SDK: https://github.com/Azure-Samples/ai-rag-chat-evaluator/pull/52/files#diff-72effa77bf8138803cfdb75cd98249445fa04006826cd01b5ce76dd1ebbdfacf

I'm not sure if you're using this repo, but if you're using latest version of SDK, questions should be question and contexts should be context in the data_mapping dict.

bhaskarturkar commented 4 months ago

Hi @pamelafox, followed your suggestion ("questions" replaced by "question" and "contexts" replaced by "context")  now getting scores in the range of 1 to 5. 

Thanks