Calculating overall metrics occasionally fails with error on return int(rating) >= 4

dfberry commented 9 months ago

console looked like:

python3 -m scripts evaluate --config=example_config.json --numquestions=14

error looked like

2024-01-13 00:04:04 (INFO) scripts: Running evaluation from config /workspaces/ai-rag-chat-evaluator/example_config.json
2024-01-13 00:04:04 (INFO) scripts: Replaced results_dir in config with timestamp
2024-01-13 00:04:04 (INFO) scripts: Replaced prompt_template in config with contents of example_input/prompt_refined.txt
2024-01-13 00:04:05 (INFO) azure.identity._internal.decorators: AzureDeveloperCliCredential.get_token succeeded
2024-01-13 00:04:05 (INFO) scripts: Running evaluation using data from /workspaces/ai-rag-chat-evaluator/example_input/qa-2.jsonl
2024-01-13 00:04:05 (INFO) scripts: Limiting evaluation to 14 questions
2024-01-13 00:04:05 (INFO) scripts: Sending a test question to the target to ensure it is running...
2024-01-13 00:04:28 (INFO) scripts: Starting evaluation...
Fail writing properties '{'_azureml.evaluation_run': 'azure-ai-generative-parent'}' to run history: 'FileStore' object has no attribute 'get_host_creds'
2024-01-13 00:04:28 (WARNING) azureml-metrics: Computing metrics for {'gpt_relevance', 'gpt_groundedness', 'gpt_coherence'} as y_test is None.
2024-01-13 00:04:28 (INFO) azureml-metrics: Setting max_concurrent_requests to 4 for computing GPT based question answering metrics
2024-01-13 00:04:28 (INFO) azureml-metrics: [azureml-metrics] ActivityStarted: compute_metrics-qa, ActivityType: ComputeMetrics, CustomDimensions: {'app_name': 'azureml-metrics', 'task_type': 'qa', 'azureml_metrics_run_id': 'c4410548-31bf-4409-a474-f327248fd0ca', 'current_timestamp': '2024-01-13 00:04:28'}
2024-01-13 00:04:28 (WARNING) azureml.metrics.text.qa.azureml_qa_metrics: LLM related metrics need llm_params to be computed. Computing metrics for ['gpt_relevance', 'gpt_groundedness', 'gpt_coherence']
2024-01-13 00:04:28 (INFO) azureml.metrics.common._validation: QA metrics debug: {'y_test_length': 0, 'y_pred_length': 14, 'tokenizer_example_output': 'the quick brown fox jumped over the lazy dog', 'regexes_to_ignore': '', 'ignore_case': False, 'ignore_punctuation': False, 'ignore_numbers': False}
2024-01-13 00:04:28 (WARNING) azureml.metrics.text.qa._seq2seq_qa: gpt_relevance metric is not applicable as it needs question and context for every example.
2024-01-13 00:04:28 (ERROR) azureml.metrics.rai.groundedness._groundedness_base: Contexts not provided.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:05<00:00,  2.46it/s]
/home/vscode/.local/lib/python3.11/site-packages/azureml/metrics/common/utilities.py:293: RuntimeWarning: Mean of empty slice
  metrics_result[constants.Metric.Metrics][mean_metric_name] = np.nanmean(metric_value)
/home/vscode/.local/lib/python3.11/site-packages/azureml/metrics/common/utilities.py:294: RuntimeWarning: All-NaN slice encountered
  metrics_result[constants.Metric.Metrics][median_metric_name] = np.nanmedian(metric_value)
2024-01-13 00:04:34 (INFO) azureml-metrics: [azureml-metrics] ActivityCompleted: Activity=compute_metrics-qa, HowEnded=SUCCESS, Duration=5735.52[ms]
Fail writing properties '{'_azureml.evaluate_artifacts': '[{"path": "eval_results.jsonl", "type": "table"}]'}' to run history: 'FileStore' object has no attribute 'get_host_creds'
2024-01-13 00:04:34 (INFO) scripts: Evaluation calls have completed. Calculating overall metrics now...
Traceback (most recent call last):

  File "<frozen runpy>", line 198, in _run_module_as_main

  File "<frozen runpy>", line 88, in _run_code

  File "/workspaces/ai-rag-chat-evaluator/scripts/__main__.py", line 6, in <module>
    app()

  File "/workspaces/ai-rag-chat-evaluator/scripts/cli.py", line 27, in evaluate
    run_evaluate_from_config(Path.cwd(), config, numquestions)

  File "/workspaces/ai-rag-chat-evaluator/scripts/evaluate.py", line 175, in run_evaluate_from_config
    run_evaluation(

  File "/workspaces/ai-rag-chat-evaluator/scripts/evaluate.py", line 117, in run_evaluation
    if passes_threshold(question_with_rating[metric_name]):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/workspaces/ai-rag-chat-evaluator/scripts/evaluate.py", line 109, in passes_threshold
    return int(rating) >= 4
           ^^^^^^^^^^^

TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

eval_results.jsonl is

{"question":"What is the focus of the Financial Market Analysis Report 2023 prepared by Contoso Financial Analytics?","answer":"The Financial Market Analysis Report 2023 prepared by Contoso Financial Analytics provides a deep dive into the financial markets of 2023, focusing on the trends and fluctuations within stocks, cryptocurrencies, and commodities. It covers historical patterns, current market conditions, and future predictions.\n[Financial Market Analysis Report 2023.pdf#page=1]","gpt_relevance":null,"gpt_groundedness":null,"gpt_coherence":5}
{"question":"What does the global financial market consist of according to the Financial Market Analysis Report 2023?","answer":"According to the Financial Market Analysis Report 2023, the global financial market is a vast and intricate network of exchanges, instruments, and assets, ranging from traditional stocks and bonds to modern cryptocurrencies and commodities.[Financial Market Analysis Report 2023.pdf#page=1]","gpt_relevance":null,"gpt_groundedness":null,"gpt_coherence":5}
{"question":"What is the Northwind Standard plan offered by Contoso in partnership with Northwind Health and what does it provide?","answer":"The Northwind Standard plan is a health insurance plan offered by Contoso in partnership with Northwind Health. This plan provides a balance billing protection, meaning that you are protected from unexpected costs when visiting in-network providers.\n[Northwind_Standard_Benefits_Details.pdf#page=7]","gpt_relevance":null,"gpt_groundedness":null,"gpt_coherence":5}
{"question":"What is balance billing and how does the Northwind Standard plan protect against it? What is an in-network provider and what happens if you visit an out-of-network provider?","answer":"Balance billing is when a provider bills you for the difference between what they charge and what your insurance company is willing to pay. This difference can be hundreds of dollars and is often unexpected. The Northwind Standard plan protects you from balance billing when visiting in-network providers. In-network providers are healthcare providers, such as doctors, hospitals, and labs, that have agreed to accept the Northwind Standard plan's payment terms. Visiting an in-network provider means that you are protected from balance billing. If you visit an out-of-network provider, you may not be protected from balance billing.[Northwind_Standard_Benefits_Details.pdf#page=7]","gpt_relevance":null,"gpt_groundedness":null,"gpt_coherence":5}
{"question":"What is the difference between in-network and out-of-network providers in the context of Northwind Health?","answer":"In-network providers have agreed to accept the amount of payment offered by Northwind Health and you pay the portion of the cost that is not covered by the insurance plan. They may also offer additional services or discounts that are not available to out-of-network providers. Out-of-network providers, on the other hand, have not contracted with Northwind Health and are not required to accept the amount of payment offered by Northwind Health. This means you may be responsible for a greater portion of the cost and they may not offer additional services or discounts that are available to in-network providers.\n[Northwind_Standard_Benefits_Details.pdf#page=8]","gpt_relevance":null,"gpt_groundedness":null,"gpt_coherence":5}
{"question":"What should one consider when choosing a provider in the context of Northwind Health?","answer":"When choosing a provider, it is important to make sure that the provider is in-network. While it is possible to receive care from out-of-network providers, it is important to understand that you will be responsible for a greater portion of the costs. To make sure that you are getting the best value for your health care expenses, it is recommended that you choose an in-network provider whenever possible.[Northwind_Standard_Benefits_Details.pdf#page=8]","gpt_relevance":null,"gpt_groundedness":null,"gpt_coherence":5}
{"question":"What is the financial implication of selecting an in-network provider under Northwind Standard plan?","answer":"When selecting an in-network provider, you will likely only have to pay a portion of the cost for services, often referred to as a \"co-pay\" or \"co-insurance\". Different types of services may require different co-pays or co-insurance, so it's best to contact your provider to understand what the cost will be.\n[Northwind_Standard_Benefits_Details.pdf#page=10]","gpt_relevance":null,"gpt_groundedness":null,"gpt_coherence":5}
{"question":"What should one expect when choosing an out-of-network provider or services not covered under the Northwind Standard plan?","answer":"If you choose an out-of-network provider, you may be responsible for the entire cost of services, or a far greater portion of the cost than you would with an in-network provider. In some cases, Northwind Standard may offer coverage for out-of-network providers, but you may still have to pay more than you would for an in-network provider. Additionally, some providers may offer services that are outside of the Northwind Standard plan coverage. Depending on the services provided, you may have to pay out-of-pocket for services that are not covered.[Northwind_Standard_Benefits_Details.pdf#page=10]","gpt_relevance":null,"gpt_groundedness":null,"gpt_coherence":5}
{"question":"What are the limitations of the Northwind Standard plan in relation to clinical trials?","answer":"The Northwind Standard plan does not cover travel expenses associated with attending clinical trials. Additionally, any experimental treatments or services that are not part of the clinical trial are not covered.\n[Northwind_Standard_Benefits_Details.pdf#page=24]","gpt_relevance":null,"gpt_groundedness":null,"gpt_coherence":5}
{"question":"What should members consider before participating in a clinical trial and how can Northwind Health assist?","answer":"Before participating in a clinical trial, members should consult with their doctor to make sure it's the right choice for them, understand the risks and benefits, be aware that not all clinical trials are free and some may require a fee. They should also note that clinical trials are often available in limited locations, so they may have to travel. Northwind Health can provide information about nearby clinical trials and may be able to help with the cost of travel.[Northwind_Standard_Benefits_Details.pdf#page=24]","gpt_relevance":null,"gpt_groundedness":null,"gpt_coherence":5}
{"question":"What are the gender affirming care services offered by Contoso through Northwind Standard?","answer":"Contoso offers coverage through Northwind Standard for a variety of gender affirming care services related to gender transition. These services may include hormone therapy, gender affirming surgeries, and mental health care related to gender transition.\n[Northwind_Standard_Benefits_Details.pdf#page=29]","gpt_relevance":null,"gpt_groundedness":null,"gpt_coherence":5}
{"question":"Can you explain the different types of gender affirming treatments and services offered by Contoso?","answer":"Contoso offers three main types of gender affirming treatments and services. Hormone therapy is a treatment that involves taking hormones typically associated with a certain gender to help an individual's body better reflect their gender identity. Gender affirming surgery involves surgical procedures to help an individual transition to the gender they identify with. Mental health care related to gender transition can include counseling and therapy services to help an individual through their transition process.[Northwind_Standard_Benefits_Details.pdf#page=29]","gpt_relevance":null,"gpt_groundedness":null,"gpt_coherence":5}
{"question":"What is the company's stance on providing coverage for hearing care services to its employees?","answer":"Contoso is committed to providing comprehensive coverage for hearing care services to its employees through the Northwind Standard plan. This includes coverage for all types of hearing care services, such as hearing tests, hearing aids and accessories, cochlear implants, and bone-anchored hearing aids.\n[Northwind_Standard_Benefits_Details.pdf#page=30]","gpt_relevance":null,"gpt_groundedness":null,"gpt_coherence":5}
{"question":"What specific services and items are covered under the Northwind Standard plan for hearing care, and what are the limits on these coverages?","answer":"The Northwind Standard plan covers all types of hearing care services, including hearing tests, hearing aids and hearing aid accessories, as well as cochlear implants and bone-anchored hearing aids. If a hearing aid or cochlear implant is needed, the plan covers up to two hearing aids or implants per ear every three years. The plan also covers routine hearing aid maintenance and repair, as well as counseling and hearing aid evaluations. Additionally, the plan covers up to $500 per ear for hearing aid accessories, such as ear molds, batteries, and tubing.[Northwind_Standard_Benefits_Details.pdf#page=30]","gpt_relevance":null,"gpt_groundedness":null,"gpt_coherence":5}

pamelafox commented 9 months ago

Please change line 44 in generate.py to:

qa.append({"question": question, "truth": answer + citation})

pamelafox commented 9 months ago

Merged fix for this, thanks for the report!

dfberry commented 9 months ago

I'm still having problems with this. If I print out the value, it says None. Should the fn take that into account?

def passes_threshold(rating):
    if rating is None:
        return False
    return int(rating) >=4

pamelafox commented 9 months ago

Okay, it didn't calculate a metric for some reason. Can you share the logs?

I can take it into account, but it usually indicates an error somewhere earlier in the script, so I'd want to put a helpful message for debugging.

pamelafox commented 9 months ago

When you get this running again, please share full output from your session, I can try to pinpoint why you'd have a None value

dfberry commented 9 months ago

The full output is at the top of the issue. Can you see if there is something there that can help pin down the issue or give me the next step? @pamelafox

pamelafox commented 9 months ago

Ah okay so that looks like the error from before, when generate created example ground truth data which contained "answer" for the column name. Can you check qa-2.jsonl and make sure the column is named "truth"?

dfberry commented 9 months ago

Ok, its working now with that change and all the ratings are returning. I'll close this.

cpatrickalves commented 9 months ago

I am getting the same issue, some times the rating is None. I did not understand how you fixed it @dfberry.

I've added a print to make sure the keys are correct:

Note that this occurs eventually, but the bigger the dataset, the more this happens. @pamelafox, it seems to be an issue in the metrics computation (by GPT-4). I hope that this is not related to the use of a language different than English :(

return int(rating) >= 4
           ^^^^^^^^^^^
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

I wondering if replacing None with the min value (1) would be a valid solution.

cpatrickalves commented 9 months ago

During my tests, this seems to happen only with the GPT Relevance metric and in different QA pairs (sometimes it works, sometimes not, that is, I get None)

cpatrickalves commented 9 months ago

For now I've just change the code to:

    def passes_threshold(rating):
        if rating is None:
            return False
        return int(rating) >= 4

pamelafox commented 9 months ago

Did you notice anything in the logs about rate-limited exceeded? That can happen with your GPT-4 instance. I should probably make the change you have there, but add a warning about missing data, and count up how much missing data there is.

cpatrickalves commented 9 months ago

Did you notice anything in the logs about rate-limited exceeded? That can happen with your GPT-4 instance.

No, I did not.

I should probably make the change you have there, but add a warning about missing data, and count up how much missing data there is. That should work.

Another option would be to add a retry for the QA pairs that get the None as the metric. Example: Finish the current process, and check if any metric gets a None, the ones you get, try again (only that ones) until you remove all Nones

Azure-Samples / ai-rag-chat-evaluator

Calculating overall metrics occasionally fails with error on return int(rating) >= 4 #10