相比qwen第一版，显存占用为什么增加了很多？

charliedream1 commented 7 months ago

通过缩小context window和pos embedding大小，好像都没有用。是什么导致显存占用增加了，相比qwen第一版显存占用增加了很多

jklj077 commented 7 months ago

Please first check the pinned issue and see if your memory profiling matches ours.

charliedream1 commented 7 months ago

For inference it is normal, but training takes more memory

---原始邮件--- 发件人: "Ren @.> 发送时间: 2024年4月2日(周二) 晚上7:21 收件人: @.>; 抄送: "Optimus @.**@.>; 主题: Re: [QwenLM/Qwen1.5] 相比qwen第一版，显存占用为什么增加了很多？ (Issue #240)

Please first check the pinned issue and see if your memory profiling matches ours.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

baisechundu commented 7 months ago

For inference it is normal, but training takes more memory … ---原始邮件--- 发件人: "Ren @.> 发送时间: 2024年4月2日(周二) 晚上7:21 收件人: @.>; 抄送: "Optimus @.**@.>; 主题: Re: [QwenLM/Qwen1.5] 相比qwen第一版，显存占用为什么增加了很多？ (Issue #240) Please first check the pinned issue and see if your memory profiling matches ours. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Maybe you choose the eager (default) mode for Attention. Here is the part of source code: QWEN2_ATTENTION_CLASSES = { "eager": Qwen2Attention, "flash_attention_2": Qwen2FlashAttention2, "sdpa": Qwen2SdpaAttention, } self.self_attn = QWEN2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)

charliedream1 commented 7 months ago

I didn't see this in config.json. So what should I do to change this eager mode? what does this mode do? If removing, will it give impact on performance?

---原始邮件--- 发件人: @.> 发送时间: 2024年4月8日(周一) 下午2:11 收件人: @.>; 抄送: "Optimus @.**@.>; 主题: Re: [QwenLM/Qwen1.5] 相比qwen第一版，显存占用为什么增加了很多？ (Issue #240)

For inference it is normal, but training takes more memory … ---原始邮件--- 发件人: "Ren @.> 发送时间: 2024年4月2日(周二) 晚上7:21 收件人: @.>; 抄送: "Optimus @.@.>; 主题: Re: [QwenLM/Qwen1.5] 相比qwen第一版，显存占用为什么增加了很多？ (Issue #240) Please first check the pinned issue and see if your memory profiling matches ours. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Maybe you choose the eager (default) mode for Attention. Here is the part of source code: QWEN2_ATTENTION_CLASSES = { "eager": Qwen2Attention, "flash_attention_2": Qwen2FlashAttention2, "sdpa": Qwen2SdpaAttention, } self.self_attn = QWEN2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

baisechundu commented 7 months ago

I didn't see this in config.json. So what should I do to change this eager mode? what does this mode do? If removing, will it give impact on performance? … ---原始邮件--- 发件人: @.> 发送时间: 2024年4月8日(周一) 下午2:11 收件人: @.>; 抄送: "Optimus @.**@.>; 主题: Re: [QwenLM/Qwen1.5] 相比qwen第一版，显存占用为什么增加了很多？ (Issue #240) For inference it is normal, but training takes more memory … ---原始邮件--- 发件人: "Ren @.> 发送时间: 2024年4月2日(周二) 晚上7:21 收件人: @.>; 抄送: "Optimus @.@.>; 主题: Re: [QwenLM/Qwen1.5] 相比qwen第一版，显存占用为什么增加了很多？ (Issue #240) Please first check the pinned issue and see if your memory profiling matches ours. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.> Maybe you choose the eager (default) mode for Attention. Here is the part of source code: QWEN2_ATTENTION_CLASSES = { "eager": Qwen2Attention, "flash_attention_2": Qwen2FlashAttention2, "sdpa": Qwen2SdpaAttention, } self.self_attn = QWEN2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.>

Two methods for you, it works for me:

Add this parameters in config.json : "_attn_implementation": "sdpa"
Add this parameters in sft code: model = AutoModelForCausalLM.from_pretrained( model_args.model_name_or_path, config=config, cache_dir=training_args.cache_dir, device_map=device_map, attn_implementation="sdpa", # Add attn_implementation quantization_config=BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=compute_dtype, ) if training_args.use_lora and lora_args.q_lora else None, **model_load_kwargs, )

charliedream1 commented 7 months ago

thx 4 reply. I"ll try

---原始邮件--- 发件人: @.> 发送时间: 2024年4月8日(周一) 下午2:22 收件人: @.>; 抄送: "Optimus @.**@.>; 主题: Re: [QwenLM/Qwen1.5] 相比qwen第一版，显存占用为什么增加了很多？ (Issue #240)

I didn't see this in config.json. So what should I do to change this eager mode? what does this mode do? If removing, will it give impact on performance? … ---原始邮件--- 发件人: @.> 发送时间: 2024年4月8日(周一) 下午2:11 收件人: @.>; 抄送: "Optimus @.@.>; 主题: Re: [QwenLM/Qwen1.5] 相比qwen第一版，显存占用为什么增加了很多？ (Issue #240) For inference it is normal, but training takes more memory … ---原始邮件--- 发件人: "Ren @.> 发送时间: 2024年4月2日(周二) 晚上7:21 收件人: @.>; 抄送: "Optimus @.@.>; 主题: Re: [QwenLM/Qwen1.5] 相比qwen第一版，显存占用为什么增加了很多？ (Issue #240) Please first check the pinned issue and see if your memory profiling matches ours. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.> Maybe you choose the eager (default) mode for Attention. Here is the part of source code: QWEN2_ATTENTION_CLASSES = { "eager": Qwen2Attention, "flash_attention_2": Qwen2FlashAttention2, "sdpa": Qwen2SdpaAttention, } self.self_attn = QWEN2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.>

Two methods for you, it works for me:

Add this parameters in config.json : "_attn_implementation": "sdpa"

Add this parameters in sft code: model = AutoModelForCausalLM.from_pretrained( model_args.model_name_or_path, config=config, cache_dir=training_args.cache_dir, device_map=device_map, attn_implementation="sdpa", # Add attn_implementation quantization_config=BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=compute_dtype, ) if training_args.use_lora and lora_args.q_lora else None, **model_load_kwargs, )

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

WeixuanXiong commented 7 months ago

这边使用2.5.7版本的flashattn后也观察到显存占用远远高于Qwen1。有没有什么优化方案呢。

jklj077 commented 6 months ago

Qwen(1.0) will automatically enable flash attention if it is installed, which is no longer the case for Qwen1.5.

To enable flash attention in Qwen1.5, please follow the instructions provided in the transformers' official documentation at https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2. In short, please ensure that attn_implementation is set to "flash_attention_2" and torch_dtype is set to "auto" or torch.bfloat16 or torch.float16 when calling from_pretrained for it to take effect.

We don't recommend bitsandbytes as you may suffer from substantial accuracy loss. If you must use quantization, try loading the GPTQ or the AWQ version and then use QLoRA.

LuJunru commented 5 months ago

I think this may be related to the transformers issue: https://github.com/huggingface/transformers/issues/30860. Since many models are influenced. In Qwen codes, there's no logits = logits.float().

philipgao518 commented 3 months ago

测试过了，用dp3、flash_attention_2，qwen1.5的72B在16张A10显卡下微调可以开到2048tokens，相同参数下qwen2只能跑到1024tokens，显存消耗增加了不少，是模型结构变化了吗？

QwenLM / Qwen2.5

相比qwen第一版，显存占用为什么增加了很多？ #240