BirgerMoell commented 3 months ago

https://huggingface.co/datasets/BioMistral/BioInstructQA Screenshot 2024-04-03 at 22 32 34 BioMistral uses the following datasets. Check which one are good candidates to translate to Swedish and find the original location of the data.

[ ] ClinicalKG
[ ] Medical Genetics
[ ] Anatomy
[ ] Pro Medicine
[ ] College Biology
[ ] College Medicine
[ ] MedQA
[x] PubMedQA We started translating PubMedQA
[ ] MedMCQA

northern-64bit commented 3 months ago

TL;DR

The comparison between different benchmarks for evaluating Large Language Models (LLMs) in the medical domain suggests prioritizing PubMedQA-L for its high-quality responses. Following this, validation sets from EN MedQA (USMLE) and then MedMCQA are recommended. Additionally, creating a new validation benchmark based on questions from Swedish medical exams is advised, leveraging tests from Umeå University (https://www.umu.se/utbildning/sok/kunskapsprov/kunskapsprov-for-lakare/teoretiskt-delprov/) as they are already in multiple-choice format and do not require translation.

Benchmark comparison

Method

I think that a good approach is to check which models are used by similar LLM medical LLM:s, that's why I'll compare the benchmarks from the Meditron model. Further I check the BLURB benchmark, since it has been a widely used one.

Meditron

The Meditron model (https://huggingface.co/epfl-llm/meditron-7b, paper: https://arxiv.org/abs/2311.16079), used the following three benchmarks as their main ones and mentioned them as the major ones:

MedMCQA (https://huggingface.co/datasets/openlifescienceai/medmcqa) Summary: Comprising over 194,000 four-option multiple-choice questions from Indian medical entrance exams (AIIMS/NEET), MedMCQA covers 2,400 healthcare topics across 21 medical subjects. It includes 187,000 training samples and a validation set of 4,183 questions. The dataset's wide coverage makes it a comprehensive benchmark for evaluating models on a range of medical subjects, although it lacks publicly available answer keys for its test set.
MedQA (USMLE) (https://huggingface.co/datasets/bigbio/med_qa) Summary: This dataset consists of questions styled after the US Medical License Exam (USMLE), including various medical knowledge aspects like patient profiles, disease symptoms, and drug dosages. It's noted for its difficulty due to the contextual knowledge required to answer the questions. The dataset includes 10,178 training samples, 1,273 test and 1,273 validation questions, with each question offering either four or five possible answers. The dataset is designed to test medical Large Language Models (LLMs) without providing long explanatory answers but includes questions with human-written explanations for certain applications.
PubMedQA (https://huggingface.co/datasets/bigbio/pubmed_qa) Summary: A biomedical question answering (QA) dataset derived from PubMed abstracts, PubMedQA is designed to address research biomedical questions that can be answered with "yes," "no," or "maybe," utilizing the corresponding abstracts for context. The dataset comprises 1,000 expert-annotated instances (PQA-L), 61,200 unlabeled instances (PQA-U), and 211,300 artificially generated QA instances (PQA-A), offering a rich resource for training and evaluating models on their ability to interpret and answer complex biomedical questions based on abstracts.

These, they used for fine tuning with the following instructions:

Further they used:

MMLU-Medical (https://huggingface.co/datasets/lukaemon/mmlu)
MedQA-4-Option (https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options)

BLURB

BLURB is a comprehensive biomedical NLP benchmark encompassing 13 datasets across 6 tasks, aiming to advance research in biomedical natural language processing, particularly in language model pretraining and universal biomedical NLP applications. It includes a variety of tasks like Named Entity Recognition (NER), PICO (Population, Interventions, Comparators, and Outcomes) extraction, Relation Extraction, Sentence Similarity, Document Classification, and Question Answering. The tasks and datasets focus on specific biomedical NLP challenges, including disease and gene mention recognition, chemical-disease relation extraction, drug-drug interaction detection, biomedical sentence similarity assessment, document classification according to cancer hallmarks, and biomedical question answering. Each dataset is evaluated using appropriate metrics, such as F1 scores for entity-level recognition, Macro/Micro F1 scores, Pearson correlation for sentence similarity, and accuracy for question answering. This benchmark offers a rich resource for developing and evaluating NLP models on a wide range of biomedical texts, facilitating a deeper understanding of language model performance in the biomedical domain. It has been for instance been used in the following paper: https://www.biorxiv.org/content/10.1101/2023.04.19.537463v1.abstract

For us the QA part is probably the most relevant. Here they use the PubMedQA and BioASQ (https://microsoft.github.io/BLURB/tasks.html#page-top).

The BioASQ QA benchmark (the data can be found here: http://participants-area.bioasq.org/datasets/ under Task B) contain four types of questions:

Yes/no questions: These are questions that, strictly speaking, require "yes" or "no" answers, though of course in practice longer answers will often be desirable. For example, "Do CpG islands colocalise with transcription start sites?" is a yes/no question.
Factoid questions: These are questions that, strictly speaking, require a particular entity name (e.g., of a disease, drug, or gene), a number, or a similar short expression as an answer, though again a longer answer may be desirable in practice. For example, "Which virus is best known as the cause of infectious mononucleosis?" is a factoid question.
List questions: These are questions that, strictly speaking, require a list of entity names (e.g., a list of gene names), numbers, or similar short expressions as an answer; again, in practice additional information may be desirable. For example, "Which are the Raf kinase inhibitors?" is a list question.
Summary questions: These are questions that do not belong in any of the previous categories and can only be answered by producing a short text summarizing the most prominent relevant information. For example, "What is the treatment of infectious mononucleosis?" is a summary question.

Overview

Here is a table from A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics (https://arxiv.org/abs/2310.05694), showing different LLM:s and their evaluation methods/benchmarks: Another graph from the paper above further shows the wide use of MedMCQA and PubMedQA: It also raises the following remarks on benchmark evaluations:

Conclusion

I recommend focusing on PubMedQA-L first. Specifically L, due to the higher quality of the responses. Then one might go over to the validation sets of EN MedQA (USMLE) and then MedMCQA.

If we can create a new validation benchmark of questions from Swedish medical exams, we definitely should, since it would be great to add. We also don't need to do any translations for this. This may take some time but this would be highly valuable.

A good start might be with old tests from Umeå university, which are multiple choices: https://www.umu.se/utbildning/sok/kunskapsprov/kunskapsprov-for-lakare/teoretiskt-delprov/

BirgerMoell commented 3 months ago

Great work!

I think starting out with PubMEDQA-L sounds great.

I started out translating sentences here with 4 yes, 3 no and 3 maybes.

https://github.com/BirgerMoell/swedish-medical-benchmark/blob/main/benchmarks/pubmedqa/data/ori_pqal_swe.json

How are the L part of the benchmark different from the ones we translated?

One issue with the yes / no / maybe benchmark is that there is no way to know if the model understands, but I guess that might not be a priority as a first step?

I think one thing to start out would be to have a small benchmark and see if we can get a medical expert to check these to see that they make sense in a Swedish context.

northern-64bit commented 3 months ago

How are the L part of the benchmark different from the ones we translated?

The L part is made up of ca. 1k expert-annotated instances. We can keep the ones that we already have translated, but I think that it would be good to prioritise the best questions so that we can be 100% sure on and then continue with the rest. The other parts are U part which consists of ca. 61k unlabeled instances and the A part which consists of ca. 211k artificially generated QA instances.

One issue with the yes / no / maybe benchmark is that there is no way to know if the model understands, but I guess that might not be a priority as a first step?

Agreed, that it might not be the first step. We need to some test examples. If a model has an issue with understanding the question, it would have a hard time to be useful. Plus we can make a check by asking what the LLM how it understood the question. System prompting can further help to make each LLM understand the questions correctly.

I think one thing to start out would be to have a small benchmark and see if we can get a medical expert to check these to see that they make sense in a Swedish context.

This is a good idea! Otherwise we risk needing to reformulate everything at a later stage.

BirgerMoell commented 3 months ago

Is the one we are using from the L part of the dataset? https://raw.githubusercontent.com/BirgerMoell/swedish-medical-benchmark/main/benchmarks/pubmedqa/data/ori_pqal.json

I used the ones where there is ground truth from the ground truth file. https://github.com/BirgerMoell/swedish-medical-benchmark/blob/main/benchmarks/pubmedqa/data/test_ground_truth.json

northern-64bit commented 3 months ago

Is the one we are using from the L part of the dataset? https://raw.githubusercontent.com/BirgerMoell/swedish-medical-benchmark/main/benchmarks/pubmedqa/data/ori_pqal.json

Yes, I just checked and the json file is called ori_pqal (it has an L at the end) implies that is the L part and it has 1000 questions, which only the L part has. I further checked the questions and it is the L part.

Only half of the questions in the L-part are in the ground truth file, which is interesting, but we still can use the final_decision part which we have for all instances (although it might not be perfect). It is the part which they use to measure human performance (https://github.com/pubmedqa/pubmedqa/blob/master/get_human_performance.py). However the evaluations code (https://github.com/pubmedqa/pubmedqa/blob/master/evaluation.py) only uses the ground truth, so we might want to focus on those questions at first.

BirgerMoell / swedish-medical-benchmark

Evaluate benchmarks from BioMistral to decide which ones are good to translate use #1

TL;DR

Benchmark comparison

Method

Meditron

BLURB

Overview

Conclusion