Chatbot Arena style human evaluation

One idea is to make a larger benchmark consisting of both an automatic eval and a human eval in the style of chatbot arena. https://chat.lmsys.org/

Since chatbot arena is open source we could probably tweak the chatbot arena code a bit and translate to Swedish and make it clear that users should ask questions in the medical domain / in Swedish. https://github.com/lm-sys/FastChat

This would give us comprehensive results of the human preference for Swedish which I think would be nice.

If this sounds like a good idea I could add the code to a subfolder and we can build out a first version.

BirgerMoell / swedish-medical-benchmark

Chatbot Arena style human evaluation #3