In evaluate_model is a script for benchmarking a model. It can be run for any model from Huggingface and gives a result of how many correct answers are in the data.
You can probably update the script so that it can be run on gpt-4 or simply try asking GPT-4 to evaluate with a similar prompt. Since there is 100 questions it might be worthwhile to write a script instead of doing it by hand.
There are two different ways to eval, using the question and using additional context. We are currently just evaluating with the question.
In evaluate_model is a script for benchmarking a model. It can be run for any model from Huggingface and gives a result of how many correct answers are in the data.
You can probably update the script so that it can be run on gpt-4 or simply try asking GPT-4 to evaluate with a similar prompt. Since there is 100 questions it might be worthwhile to write a script instead of doing it by hand.
There are two different ways to eval, using the question and using additional context. We are currently just evaluating with the question.
https://github.com/BirgerMoell/swedish-medical-benchmark/blob/main/evaluate_model.py