Closed CrispStrobe closed 6 months ago
Great idea, & thanks for doing this. I don't see any reason this shouldn't work in principle, although it would be good to bench a few models to sanity check the scores.
great, thanks! i added the 2 translate script in /data too now.
first test runs seem not too far off from expectations (that discolm performs much lower in benchmarks than expected is a quandary others noted too, cf. https://www.linkedin.com/feed/update/urn:li:activity:7160912610182184960/) myrun1,2024-02-25 05:29:17,openai_api,brezn-7b,,,59.27,v2,170.0,1,openai,,, myrun3,2024-02-25 10:53:47,openai_api,wiedervereinigung-7b-dpo-laser,,,51.2,v2,171.0,1,openai,,, myrun5,2024-02-25 14:58:59,openai_api,cas/nous-hermes-2-mistral-7b-dpo,,,49.96,v2,171.0,1,openai,,, myrun4,2024-02-25 12:40:16,openai_api,marco/em_german_mistral_v01,,,45.18,v2,171.0,1,openai,,, myrun2,2024-02-25 08:12:40,openai_api,cas/discolm-german-laser,,,43.05,v2,171.0,1,openai,,,
2024-02-26 13:34:40 Time taken: 3.1 mins Prompt Format: openai_api Model: gpt-3.5-turbo-0125 Score (v2 de): 60.43 Parseable: 171.0
2024-02-26 14:27:22 Time taken: 52.7 mins Prompt Format: openai_api Model: gpt-4-1106-preview Score (v2 de): 81.91 Parseable: 170.0
These results are very close to the en version, which tells me this has worked. I'll stress test it a bit more then merge your changes.
Btw are you in the DiscoResearch discord group? I think they will be interested in this.
I've merged this into new v2.1 branch. Will make this default shortly. Thx for your work on this!
this was for testing purposes, seems to work so far... what do you think, would you like to include (something like) this? it is yet rudimentary (no further handling of results etc). the question dataset is automatically translated by ChatGPT-4-turbo, the translation script checked for basic consistency, but this was a quick first try and there might be numerous glitches of course. i also built a quick script to translate more systematically per facebook/wmt19-en-de but results were worse linguistically.