BirgerMoell / swedish-medical-benchmark

8 stars 0 forks source link

Evaluate benchmarks from BioMistral to decide which ones are good to translate use #1

Open BirgerMoell opened 3 months ago

BirgerMoell commented 3 months ago

https://huggingface.co/datasets/BioMistral/BioInstructQA Screenshot 2024-04-03 at 22 32 34 BioMistral uses the following datasets. Check which one are good candidates to translate to Swedish and find the original location of the data.

northern-64bit commented 3 months ago

TL;DR

The comparison between different benchmarks for evaluating Large Language Models (LLMs) in the medical domain suggests prioritizing PubMedQA-L for its high-quality responses. Following this, validation sets from EN MedQA (USMLE) and then MedMCQA are recommended. Additionally, creating a new validation benchmark based on questions from Swedish medical exams is advised, leveraging tests from Umeå University (https://www.umu.se/utbildning/sok/kunskapsprov/kunskapsprov-for-lakare/teoretiskt-delprov/) as they are already in multiple-choice format and do not require translation.

Benchmark comparison

Method

I think that a good approach is to check which models are used by similar LLM medical LLM:s, that's why I'll compare the benchmarks from the Meditron model. Further I check the BLURB benchmark, since it has been a widely used one.

Meditron

The Meditron model (https://huggingface.co/epfl-llm/meditron-7b, paper: https://arxiv.org/abs/2311.16079), used the following three benchmarks as their main ones and mentioned them as the major ones:

These, they used for fine tuning with the following instructions: image

Further they used:

BLURB

BLURB is a comprehensive biomedical NLP benchmark encompassing 13 datasets across 6 tasks, aiming to advance research in biomedical natural language processing, particularly in language model pretraining and universal biomedical NLP applications. It includes a variety of tasks like Named Entity Recognition (NER), PICO (Population, Interventions, Comparators, and Outcomes) extraction, Relation Extraction, Sentence Similarity, Document Classification, and Question Answering. The tasks and datasets focus on specific biomedical NLP challenges, including disease and gene mention recognition, chemical-disease relation extraction, drug-drug interaction detection, biomedical sentence similarity assessment, document classification according to cancer hallmarks, and biomedical question answering. Each dataset is evaluated using appropriate metrics, such as F1 scores for entity-level recognition, Macro/Micro F1 scores, Pearson correlation for sentence similarity, and accuracy for question answering. This benchmark offers a rich resource for developing and evaluating NLP models on a wide range of biomedical texts, facilitating a deeper understanding of language model performance in the biomedical domain. It has been for instance been used in the following paper: https://www.biorxiv.org/content/10.1101/2023.04.19.537463v1.abstract

For us the QA part is probably the most relevant. Here they use the PubMedQA and BioASQ (https://microsoft.github.io/BLURB/tasks.html#page-top).

The BioASQ QA benchmark (the data can be found here: http://participants-area.bioasq.org/datasets/ under Task B) contain four types of questions:

Overview

Here is a table from A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics (https://arxiv.org/abs/2310.05694), showing different LLM:s and their evaluation methods/benchmarks: image Another graph from the paper above further shows the wide use of MedMCQA and PubMedQA: image It also raises the following remarks on benchmark evaluations: image

Conclusion

I recommend focusing on PubMedQA-L first. Specifically L, due to the higher quality of the responses. Then one might go over to the validation sets of EN MedQA (USMLE) and then MedMCQA.

If we can create a new validation benchmark of questions from Swedish medical exams, we definitely should, since it would be great to add. We also don't need to do any translations for this. This may take some time but this would be highly valuable.

A good start might be with old tests from Umeå university, which are multiple choices: https://www.umu.se/utbildning/sok/kunskapsprov/kunskapsprov-for-lakare/teoretiskt-delprov/

BirgerMoell commented 3 months ago

Great work!

I think starting out with PubMEDQA-L sounds great.

I started out translating sentences here with 4 yes, 3 no and 3 maybes.

https://github.com/BirgerMoell/swedish-medical-benchmark/blob/main/benchmarks/pubmedqa/data/ori_pqal_swe.json

How are the L part of the benchmark different from the ones we translated?

One issue with the yes / no / maybe benchmark is that there is no way to know if the model understands, but I guess that might not be a priority as a first step?

I think one thing to start out would be to have a small benchmark and see if we can get a medical expert to check these to see that they make sense in a Swedish context.

northern-64bit commented 3 months ago

How are the L part of the benchmark different from the ones we translated?

The L part is made up of ca. 1k expert-annotated instances. We can keep the ones that we already have translated, but I think that it would be good to prioritise the best questions so that we can be 100% sure on and then continue with the rest. The other parts are U part which consists of ca. 61k unlabeled instances and the A part which consists of ca. 211k artificially generated QA instances.

One issue with the yes / no / maybe benchmark is that there is no way to know if the model understands, but I guess that might not be a priority as a first step?

Agreed, that it might not be the first step. We need to some test examples. If a model has an issue with understanding the question, it would have a hard time to be useful. Plus we can make a check by asking what the LLM how it understood the question. System prompting can further help to make each LLM understand the questions correctly.

I think one thing to start out would be to have a small benchmark and see if we can get a medical expert to check these to see that they make sense in a Swedish context.

This is a good idea! Otherwise we risk needing to reformulate everything at a later stage.

BirgerMoell commented 3 months ago

Is the one we are using from the L part of the dataset? https://raw.githubusercontent.com/BirgerMoell/swedish-medical-benchmark/main/benchmarks/pubmedqa/data/ori_pqal.json

I used the ones where there is ground truth from the ground truth file. https://github.com/BirgerMoell/swedish-medical-benchmark/blob/main/benchmarks/pubmedqa/data/test_ground_truth.json

northern-64bit commented 3 months ago

Is the one we are using from the L part of the dataset? https://raw.githubusercontent.com/BirgerMoell/swedish-medical-benchmark/main/benchmarks/pubmedqa/data/ori_pqal.json

Yes, I just checked and the json file is called ori_pqal (it has an L at the end) implies that is the L part and it has 1000 questions, which only the L part has. I further checked the questions and it is the L part. image

Only half of the questions in the L-part are in the ground truth file, which is interesting, but we still can use the final_decision part which we have for all instances (although it might not be perfect). It is the part which they use to measure human performance (https://github.com/pubmedqa/pubmedqa/blob/master/get_human_performance.py). However the evaluations code (https://github.com/pubmedqa/pubmedqa/blob/master/evaluation.py) only uses the ground truth, so we might want to focus on those questions at first.