Closed charliedream1 closed 6 months ago
Please first check the pinned issue and see if your memory profiling matches ours.
For inference it is normal, but training takes more memory
---原始邮件--- 发件人: "Ren @.> 发送时间: 2024年4月2日(周二) 晚上7:21 收件人: @.>; 抄送: "Optimus @.**@.>; 主题: Re: [QwenLM/Qwen1.5] 相比qwen第一版,显存占用为什么增加了很多? (Issue #240)
Please first check the pinned issue and see if your memory profiling matches ours.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
For inference it is normal, but training takes more memory … ---原始邮件--- 发件人: "Ren @.> 发送时间: 2024年4月2日(周二) 晚上7:21 收件人: @.>; 抄送: "Optimus @.**@.>; 主题: Re: [QwenLM/Qwen1.5] 相比qwen第一版,显存占用为什么增加了很多? (Issue #240) Please first check the pinned issue and see if your memory profiling matches ours. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Maybe you choose the eager (default) mode for Attention. Here is the part of source code: QWEN2_ATTENTION_CLASSES = { "eager": Qwen2Attention, "flash_attention_2": Qwen2FlashAttention2, "sdpa": Qwen2SdpaAttention, } self.self_attn = QWEN2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)
I didn't see this in config.json. So what should I do to change this eager mode? what does this mode do? If removing, will it give impact on performance?
---原始邮件--- 发件人: @.> 发送时间: 2024年4月8日(周一) 下午2:11 收件人: @.>; 抄送: "Optimus @.**@.>; 主题: Re: [QwenLM/Qwen1.5] 相比qwen第一版,显存占用为什么增加了很多? (Issue #240)
For inference it is normal, but training takes more memory … ---原始邮件--- 发件人: "Ren @.> 发送时间: 2024年4月2日(周二) 晚上7:21 收件人: @.>; 抄送: "Optimus @.@.>; 主题: Re: [QwenLM/Qwen1.5] 相比qwen第一版,显存占用为什么增加了很多? (Issue #240) Please first check the pinned issue and see if your memory profiling matches ours. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Maybe you choose the eager (default) mode for Attention. Here is the part of source code: QWEN2_ATTENTION_CLASSES = { "eager": Qwen2Attention, "flash_attention_2": Qwen2FlashAttention2, "sdpa": Qwen2SdpaAttention, } self.self_attn = QWEN2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
I didn't see this in config.json. So what should I do to change this eager mode? what does this mode do? If removing, will it give impact on performance? … ---原始邮件--- 发件人: @.> 发送时间: 2024年4月8日(周一) 下午2:11 收件人: @.>; 抄送: "Optimus @.**@.>; 主题: Re: [QwenLM/Qwen1.5] 相比qwen第一版,显存占用为什么增加了很多? (Issue #240) For inference it is normal, but training takes more memory … ---原始邮件--- 发件人: "Ren @.> 发送时间: 2024年4月2日(周二) 晚上7:21 收件人: @.>; 抄送: "Optimus @.@.>; 主题: Re: [QwenLM/Qwen1.5] 相比qwen第一版,显存占用为什么增加了很多? (Issue #240) Please first check the pinned issue and see if your memory profiling matches ours. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.> Maybe you choose the eager (default) mode for Attention. Here is the part of source code: QWEN2_ATTENTION_CLASSES = { "eager": Qwen2Attention, "flash_attention_2": Qwen2FlashAttention2, "sdpa": Qwen2SdpaAttention, } self.self_attn = QWEN2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.>
Two methods for you, it works for me:
thx 4 reply. I"ll try
---原始邮件--- 发件人: @.> 发送时间: 2024年4月8日(周一) 下午2:22 收件人: @.>; 抄送: "Optimus @.**@.>; 主题: Re: [QwenLM/Qwen1.5] 相比qwen第一版,显存占用为什么增加了很多? (Issue #240)
I didn't see this in config.json. So what should I do to change this eager mode? what does this mode do? If removing, will it give impact on performance? … ---原始邮件--- 发件人: @.> 发送时间: 2024年4月8日(周一) 下午2:11 收件人: @.>; 抄送: "Optimus @.@.>; 主题: Re: [QwenLM/Qwen1.5] 相比qwen第一版,显存占用为什么增加了很多? (Issue #240) For inference it is normal, but training takes more memory … ---原始邮件--- 发件人: "Ren @.> 发送时间: 2024年4月2日(周二) 晚上7:21 收件人: @.>; 抄送: "Optimus @.@.>; 主题: Re: [QwenLM/Qwen1.5] 相比qwen第一版,显存占用为什么增加了很多? (Issue #240) Please first check the pinned issue and see if your memory profiling matches ours. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.> Maybe you choose the eager (default) mode for Attention. Here is the part of source code: QWEN2_ATTENTION_CLASSES = { "eager": Qwen2Attention, "flash_attention_2": Qwen2FlashAttention2, "sdpa": Qwen2SdpaAttention, } self.self_attn = QWEN2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.>
Two methods for you, it works for me:
Add this parameters in config.json : "_attn_implementation": "sdpa"
Add this parameters in sft code: model = AutoModelForCausalLM.from_pretrained( model_args.model_name_or_path, config=config, cache_dir=training_args.cache_dir, device_map=device_map, attn_implementation="sdpa", # Add attn_implementation quantization_config=BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=compute_dtype, ) if training_args.use_lora and lora_args.q_lora else None, **model_load_kwargs, )
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
这边使用2.5.7版本的flashattn后也观察到显存占用远远高于Qwen1。有没有什么优化方案呢。
Qwen(1.0) will automatically enable flash attention if it is installed, which is no longer the case for Qwen1.5.
To enable flash attention in Qwen1.5, please follow the instructions provided in the transformers
' official documentation at https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2. In short, please ensure that attn_implementation
is set to "flash_attention_2" and torch_dtype
is set to "auto" or torch.bfloat16 or torch.float16 when calling from_pretrained
for it to take effect.
We don't recommend bitsandbytes
as you may suffer from substantial accuracy loss. If you must use quantization, try loading the GPTQ or the AWQ version and then use QLoRA.
I think this may be related to the transformers issue: https://github.com/huggingface/transformers/issues/30860. Since many models are influenced. In Qwen codes, there's no logits = logits.float()
.
测试过了,用dp3、flash_attention_2,qwen1.5的72B在16张A10显卡下微调可以开到2048tokens,相同参数下qwen2只能跑到1024tokens,显存消耗增加了不少,是模型结构变化了吗?
通过缩小context window和pos embedding大小,好像都没有用。是什么导致显存占用增加了,相比qwen第一版显存占用增加了很多