Benchmark against GPT-4

In evaluate_model is a script for benchmarking a model. It can be run for any model from Huggingface and gives a result of how many correct answers are in the data.

You can probably update the script so that it can be run on gpt-4 or simply try asking GPT-4 to evaluate with a similar prompt. Since there is 100 questions it might be worthwhile to write a script instead of doing it by hand.

There are two different ways to eval, using the question and using additional context. We are currently just evaluating with the question.

https://github.com/BirgerMoell/swedish-medical-benchmark/blob/main/evaluate_model.py

BirgerMoell / swedish-medical-benchmark

Benchmark against GPT-4 #5