Closed ToSev7en closed 1 month ago
When you run the script, did you set "attn_implementation": "flash_attention_2"
? I noticed a performance degradation when flash attention is not used for gemma-2-27b-it
. However, this is not the case for other models I've tried.
Also @ToSev7en, can you share how you ran it? I ran it with TotherAI API because VLLM was broken when I implemented it. We could figure out how to store the scores for both, if you work out the details Chris mentioned.
When you run the script, did you set
"attn_implementation": "flash_attention_2"
? I noticed a performance degradation when flash attention is not used forgemma-2-27b-it
. However, this is not the case for other models I've tried.
I downloaded gemma-2-27b-it
from hugginface hub,and here comes the config.json
:
{
"architectures": [
"Gemma2ForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"attn_logit_softcapping": 50.0,
"bos_token_id": 2,
"cache_implementation": "hybrid",
"eos_token_id": 1,
"final_logit_softcapping": 30.0,
"head_dim": 128,
"hidden_act": "gelu_pytorch_tanh",
"hidden_activation": "gelu_pytorch_tanh",
"hidden_size": 4608,
"initializer_range": 0.02,
"intermediate_size": 36864,
"max_position_embeddings": 8192,
"model_type": "gemma2",
"num_attention_heads": 32,
"num_hidden_layers": 46,
"num_key_value_heads": 16,
"pad_token_id": 0,
"query_pre_attn_scalar": 144,
"rms_norm_eps": 1e-06,
"rope_theta": 10000.0,
"sliding_window": 4096,
"sliding_window_size": 4096,
"torch_dtype": "bfloat16",
"transformers_version": "4.42.0.dev0",
"use_cache": true,
"vocab_size": 256000
}
Thers is no "attn_implementation": "flash_attention_2"
indeed, so should I add it into the config.json
?
Also @ToSev7en, can you share how you ran it? I ran it with TotherAI API because VLLM was broken when I implemented it. We could figure out how to store the scores for both, if you work out the details Chris mentioned.
@natolambert I ran it with run_generative.py
base on the latest version of vllm
, and it works but not up to the performance on leaderboard. I will try the details Chris mentioned, thx.
@ToSev7en, the issue occurred when I used a gemma-2-27b-it trained as a sequence classifier. However, I'm not sure if the same problem would happen with a generative model. I'll try it when I have the time.
Interesting @ToSev7en that the API was better... this is normal evaluation rabbit holes :)
Interesting @ToSev7en that the API was better... this is normal evaluation rabbit holes :)
@chrisliu298 @natolambert Emm....It's the intrisic implementation flaws in vLLM
Bellow code copy from https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/gemma2.py
# FIXME(woosuk): While Gemma 2 uses sliding window attention for every
# odd layer, vLLM currently ignores it and uses global attention for
# all layers.
use_sliding_window = (layer_idx % 2 == 1 and config.sliding_window is not None)
del use_sliding_window # Unused.
It seems like the symptom I described above is different from the one described by @ToSev7en, because I am training a sequence classifier. However, they might be related.
In my case, I manually pass the attn_implementation
arg to the model_kwargs
here. Nothing else is changed and the inference uses the following command:
python scripts/run_rm.py --model /path/to/my/rm --batch_size 1 --torch_dtype bfloat16 --do_not_save
Using eager (relative to flash_attention_2) decreases the average score (across four categories) by 0.086
, which is expected in my opinion.
However, using spda
(or without specifying this arg, because it's spda
by default for torch>=2.1.1) reduces the average score by 12
. This is a huge difference and it's wierd that only spda
had this problem. Note that the only change here is attn_implementation
, and everything else stays the same.
Update: I also tested it with gemma-2-2b-it
, but spda
did not lead to a severe drop in performance.
I evaluated gemma-2-27b-it {'Chat': 0.8938547486033519, 'Chat Hard': 0.6085526315789473, 'Safety': 0.8867946647946647, 'Reasoning': 0.7705588066786708}
while