Closed jukofyork closed 1 month ago
Great question! And thanks for all the info to help debug.
Ok, so I had a lot of trouble with the command-r models when I was benchmarking them. It's possible that the reason was the extra BOS token, which I guess was likely added since I'm not instantiating the tokenizer with add_special_tokens=False
per the discussion you linked.
The way I ran these models for the leaderboard scores is:
EQ-Bench: transformers using the above Cohere template MAGI: lm-eval + transformers
EQ-Bench: either transformers using the above Cohere template, or the huggingface pro api (can't remember which). I just re-ran it with huggingface api and got near-identical result to the leaderboard score. I'm not sure which template they use. MAGI: lm-eval + transformers Creative Writing: huggingface pro api
EQ-Bench: together.ai api MAGI: lm-eval + transformers Creative Writing: together.ai api
So based on that, and given we don't know what template the huggingface pro api or together.ai is using for these models, I can confidently say that I don't know why it's underperforming! It might be that they are adding an extra BOS token. I don't know how much difference that would make; it might be worth running an A:B comparison with eq-bench / creative writing with the extra BOS token included vs. not.
Otherwise, it's good to be mindful that the creative writing benchmark is pretty subjective. For instance, the Qwen 110B model has a lot of gpt slop which I hate but which sonnet doesn't seem to care about.
https://github.com/EQ-bench/EQ-Bench/blob/main_v2_4/instruction-templates/Cohere.yaml
Is there a chance that the
"<BOS_TOKEN>
was getting added twice during the tests:https://huggingface.co/CohereForAI/c4ai-command-r-plus/discussions/22#66179da37ed574892089967c
Not sure which backend was used to test against, but some will add is automatically and the HF config looks to have it both:
and:
The only reason I ask is I've been running some similar tests and I can't understand how
Qwen1.5-110B-Chat
scores so highly yet bothcommand-r
models score much lower?I've tried all the
qwen-2
andqwen-1.5
models and they all seem to be really bad at "in the style of" which a lot of your benchmarks look to ask for?I'd love find out exactly what was getting sent as the template for
qwen
andcommand-r
to see if I can get to the bottom of what is happening!Just checked the
chat-ml
template and it is a little odd too:https://github.com/EQ-bench/EQ-Bench/blob/main_v2_4/instruction-templates/ChatML.yaml
Not sure why there is an extra pipe, new line and spaces like that?