One idea is to make a larger benchmark consisting of both an automatic eval and a human eval in the style of chatbot arena.
https://chat.lmsys.org/
Since chatbot arena is open source we could probably tweak the chatbot arena code a bit and translate to Swedish and make it clear that users should ask questions in the medical domain / in Swedish.
https://github.com/lm-sys/FastChat
This would give us comprehensive results of the human preference for Swedish which I think would be nice.
If this sounds like a good idea I could add the code to a subfolder and we can build out a first version.
One idea is to make a larger benchmark consisting of both an automatic eval and a human eval in the style of chatbot arena. https://chat.lmsys.org/
Since chatbot arena is open source we could probably tweak the chatbot arena code a bit and translate to Swedish and make it clear that users should ask questions in the medical domain / in Swedish. https://github.com/lm-sys/FastChat
This would give us comprehensive results of the human preference for Swedish which I think would be nice.
If this sounds like a good idea I could add the code to a subfolder and we can build out a first version.