Closed Muennighoff closed 3 months ago
That is a good catch - generally really find the new scores interesting - I was considering when we make the mteb lite, that we also consider the correlation with MTEB Arena.
Edit: it might be worth contacting the model's authors to confirm the implementation.
Actually I think I found a mistake, which is that we were not loading its tokenizer with trust_remote_code=True
- it still works but will not append an eos_token which they otherwise do. Very subtle. Took it down from the LB temporarily and will put it back on soon
While it is still early to judge the results with the high confidence intervals in the leaderboard, it is becoming clear that
Alibaba-NLP/gte-Qwen2-7B-instruct
performs very poorly. This is despite it performing surprisingly well on multiple benchmarks (MTEB, BRIGHT - see https://huggingface.co/spaces/mteb/leaderboard. Some investigation why that's the case would be interesting. I already double-checked our implementation multiple times and it produces the same results as the script in their model card so I don't think that's the issue.