EQ-bench / EQ-Bench

A benchmark for emotional intelligence in large language models
MIT License
180 stars 13 forks source link

added support for additional language (de) #12

Closed CrispStrobe closed 6 months ago

CrispStrobe commented 7 months ago

this was for testing purposes, seems to work so far... what do you think, would you like to include (something like) this? it is yet rudimentary (no further handling of results etc). the question dataset is automatically translated by ChatGPT-4-turbo, the translation script checked for basic consistency, but this was a quick first try and there might be numerous glitches of course. i also built a quick script to translate more systematically per facebook/wmt19-en-de but results were worse linguistically.

sam-paech commented 7 months ago

Great idea, & thanks for doing this. I don't see any reason this shouldn't work in principle, although it would be good to bench a few models to sanity check the scores.

CrispStrobe commented 7 months ago

great, thanks! i added the 2 translate script in /data too now.

CrispStrobe commented 7 months ago

first test runs seem not too far off from expectations (that discolm performs much lower in benchmarks than expected is a quandary others noted too, cf. https://www.linkedin.com/feed/update/urn:li:activity:7160912610182184960/) myrun1,2024-02-25 05:29:17,openai_api,brezn-7b,,,59.27,v2,170.0,1,openai,,, myrun3,2024-02-25 10:53:47,openai_api,wiedervereinigung-7b-dpo-laser,,,51.2,v2,171.0,1,openai,,, myrun5,2024-02-25 14:58:59,openai_api,cas/nous-hermes-2-mistral-7b-dpo,,,49.96,v2,171.0,1,openai,,, myrun4,2024-02-25 12:40:16,openai_api,marco/em_german_mistral_v01,,,45.18,v2,171.0,1,openai,,, myrun2,2024-02-25 08:12:40,openai_api,cas/discolm-german-laser,,,43.05,v2,171.0,1,openai,,,

sam-paech commented 6 months ago

2024-02-26 13:34:40 Time taken: 3.1 mins Prompt Format: openai_api Model: gpt-3.5-turbo-0125 Score (v2 de): 60.43 Parseable: 171.0

2024-02-26 14:27:22 Time taken: 52.7 mins Prompt Format: openai_api Model: gpt-4-1106-preview Score (v2 de): 81.91 Parseable: 170.0

These results are very close to the en version, which tells me this has worked. I'll stress test it a bit more then merge your changes.

sam-paech commented 6 months ago

Btw are you in the DiscoResearch discord group? I think they will be interested in this.

https://discord.gg/DWSgKd7z

sam-paech commented 6 months ago

I've merged this into new v2.1 branch. Will make this default shortly. Thx for your work on this!