Keep the benchmark balanced

We now have 100 files in the Swedish benchmark which is a great start. However, the data is imbalanced with 60 yes questions. This is an issue, especially since LLMs have a yes bias and they sometimes even say yes to everything.

https://github.com/BirgerMoell/swedish-medical-benchmark/blob/main/benchmarks/pubmedqa/data/ori_pqal_swe.json

It would be great if we could balance out the benchmark by adding no and maybe questions.

Here you can find a list of the ground truth correct answers. https://github.com/BirgerMoell/swedish-medical-benchmark/blob/main/benchmarks/pubmedqa/data/test_ground_truth.json

and here are files that can be translated.

https://github.com/BirgerMoell/swedish-medical-benchmark/blob/main/benchmarks/pubmedqa/data/ori_pqal.json

We currently just translate with gpt-4 which I think is a good first step. Later on we can make a human check that the questions make sense in Swedish..

We currently have 60 yes, 30 no and 10 maybe.

BirgerMoell / swedish-medical-benchmark

Keep the benchmark balanced #4