We now have 100 files in the Swedish benchmark which is a great start. However, the data is imbalanced with 60 yes questions. This is an issue, especially since LLMs have a yes bias and they sometimes even say yes to everything.
We currently just translate with gpt-4 which I think is a good first step. Later on we can make a human check that the questions make sense in Swedish..
We now have 100 files in the Swedish benchmark which is a great start. However, the data is imbalanced with 60 yes questions. This is an issue, especially since LLMs have a yes bias and they sometimes even say yes to everything.
https://github.com/BirgerMoell/swedish-medical-benchmark/blob/main/benchmarks/pubmedqa/data/ori_pqal_swe.json
It would be great if we could balance out the benchmark by adding no and maybe questions.
Here you can find a list of the ground truth correct answers. https://github.com/BirgerMoell/swedish-medical-benchmark/blob/main/benchmarks/pubmedqa/data/test_ground_truth.json
and here are files that can be translated.
https://github.com/BirgerMoell/swedish-medical-benchmark/blob/main/benchmarks/pubmedqa/data/ori_pqal.json
We currently just translate with gpt-4 which I think is a good first step. Later on we can make a human check that the questions make sense in Swedish..
We currently have 60 yes, 30 no and 10 maybe.