The metrics of gemma-2-27b-it is not matched with the metrics on leaderboard

allenai / reward-bench

RewardBench: the first evaluation tool for reward models.

https://huggingface.co/spaces/allenai/reward-bench

Apache License 2.0

418 stars 50 forks source link

The metrics of gemma-2-27b-it is not matched with the metrics on leaderboard #163

Closed ToSev7en closed 1 month ago

ToSev7en commented 2 months ago

I evaluated gemma-2-27b-it {'Chat': 0.8938547486033519, 'Chat Hard': 0.6085526315789473, 'Safety': 0.8867946647946647, 'Reasoning': 0.7705588066786708}

while

chrisliu298 commented 2 months ago

When you run the script, did you set "attn_implementation": "flash_attention_2"? I noticed a performance degradation when flash attention is not used for gemma-2-27b-it. However, this is not the case for other models I've tried.

natolambert commented 2 months ago

Also @ToSev7en, can you share how you ran it? I ran it with TotherAI API because VLLM was broken when I implemented it. We could figure out how to store the scores for both, if you work out the details Chris mentioned.

ToSev7en commented 2 months ago

When you run the script, did you set "attn_implementation": "flash_attention_2"? I noticed a performance degradation when flash attention is not used for gemma-2-27b-it. However, this is not the case for other models I've tried.

I downloaded gemma-2-27b-it from hugginface hub，and here comes the config.json :

{
  "architectures": [
    "Gemma2ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "attn_logit_softcapping": 50.0,
  "bos_token_id": 2,
  "cache_implementation": "hybrid",
  "eos_token_id": 1,
  "final_logit_softcapping": 30.0,
  "head_dim": 128,
  "hidden_act": "gelu_pytorch_tanh",
  "hidden_activation": "gelu_pytorch_tanh",
  "hidden_size": 4608,
  "initializer_range": 0.02,
  "intermediate_size": 36864,
  "max_position_embeddings": 8192,
  "model_type": "gemma2",
  "num_attention_heads": 32,
  "num_hidden_layers": 46,
  "num_key_value_heads": 16,
  "pad_token_id": 0,
  "query_pre_attn_scalar": 144,
  "rms_norm_eps": 1e-06,
  "rope_theta": 10000.0,
  "sliding_window": 4096,
  "sliding_window_size": 4096,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.42.0.dev0",
  "use_cache": true,
  "vocab_size": 256000
}

Thers is no "attn_implementation": "flash_attention_2" indeed, so should I add it into the config.json ?

ToSev7en commented 2 months ago

Also @ToSev7en, can you share how you ran it? I ran it with TotherAI API because VLLM was broken when I implemented it. We could figure out how to store the scores for both, if you work out the details Chris mentioned.

@natolambert I ran it with run_generative.py base on the latest version of vllm, and it works but not up to the performance on leaderboard. I will try the details Chris mentioned, thx.

chrisliu298 commented 2 months ago

@ToSev7en, the issue occurred when I used a gemma-2-27b-it trained as a sequence classifier. However, I'm not sure if the same problem would happen with a generative model. I'll try it when I have the time.

natolambert commented 2 months ago

Interesting @ToSev7en that the API was better... this is normal evaluation rabbit holes :)

ToSev7en commented 2 months ago

Interesting @ToSev7en that the API was better... this is normal evaluation rabbit holes :)

@chrisliu298 @natolambert Emm....It's the intrisic implementation flaws in vLLM

Bellow code copy from https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/gemma2.py

# FIXME(woosuk): While Gemma 2 uses sliding window attention for every
# odd layer, vLLM currently ignores it and uses global attention for
# all layers.
use_sliding_window = (layer_idx % 2 == 1 and config.sliding_window is not None)
del use_sliding_window  # Unused.

chrisliu298 commented 2 months ago

It seems like the symptom I described above is different from the one described by @ToSev7en, because I am training a sequence classifier. However, they might be related.

In my case, I manually pass the attn_implementation arg to the model_kwargs here. Nothing else is changed and the inference uses the following command:

python scripts/run_rm.py --model /path/to/my/rm --batch_size 1 --torch_dtype bfloat16 --do_not_save

Using eager (relative to flash_attention_2) decreases the average score (across four categories) by 0.086, which is expected in my opinion.

However, using spda (or without specifying this arg, because it's spda by default for torch>=2.1.1) reduces the average score by 12. This is a huge difference and it's wierd that only spda had this problem. Note that the only change here is attn_implementation, and everything else stays the same.

Update: I also tested it with gemma-2-2b-it, but spda did not lead to a severe drop in performance.