embeddings-benchmark / arena

Code for the MTEB Arena
https://hf.co/spaces/mteb/arena
14 stars 6 forks source link

Why is Alibaba-NLP/gte-Qwen2-7B-instruct so bad? #30

Closed Muennighoff closed 3 months ago

Muennighoff commented 3 months ago

While it is still early to judge the results with the high confidence intervals in the leaderboard, it is becoming clear that Alibaba-NLP/gte-Qwen2-7B-instruct performs very poorly. This is despite it performing surprisingly well on multiple benchmarks (MTEB, BRIGHT - see https://huggingface.co/spaces/mteb/leaderboard. Some investigation why that's the case would be interesting. I already double-checked our implementation multiple times and it produces the same results as the script in their model card so I don't think that's the issue.

KennethEnevoldsen commented 3 months ago

That is a good catch - generally really find the new scores interesting - I was considering when we make the mteb lite, that we also consider the correlation with MTEB Arena.

Edit: it might be worth contacting the model's authors to confirm the implementation.

Muennighoff commented 3 months ago

Actually I think I found a mistake, which is that we were not loading its tokenizer with trust_remote_code=True - it still works but will not append an eos_token which they otherwise do. Very subtle. Took it down from the LB temporarily and will put it back on soon