alibaba / Pai-Megatron-Patch

The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.
Apache License 2.0
729 stars 103 forks source link

cannot import name 'TEDotProductAttentionMLA' when running `examples/deepseek_v2/run_mcore_deepseek.sh` #359

Open dreasysnail opened 1 month ago

dreasysnail commented 1 month ago

Thank you for the great project! When I run examples/deepseek_v2/run_mcore_deepseek.sh I got an error as below:

Traceback (most recent call last):
  File "/mnt/task_runtime/examples/deepseek_v2/pretrain_deepseek.py", line 37, in <module>
    from megatron_patch.model.deepseek_v2.layer_specs import (
  File "/mnt/task_runtime/megatron_patch/model/deepseek_v2/layer_specs.py", line 19, in <module>
    from megatron.core.transformer.custom_layers.transformer_engine import (
ImportError: cannot import name 'TEDotProductAttentionMLA' from 'megatron.core.transformer.custom_layers.transformer_engine' (/mnt/task_runtime/PAI-Megatron-LM-240718/megatron/core/transformer/custom_layers/transformer_engine.py)

It appears that in this link the code is attempting to import 'TEDotProductAttentionMLA', but when I checked the megatron.core.transformer.custom_layers.transformer_engine file, I did not find 'TEDotProductAttentionMLA'.

Any help appreciated!

dreasysnail commented 1 month ago

@Jiayi-Pan

NiuMa-1234 commented 4 weeks ago

Hi, have you solved the problem? I 'm trying to use TEDotProductAttentionMLA, too, and I found the difference between it and its original TEDotProductAttention is only the definition of kv_channels. So I just manually change the kv_channels and keep using the TEDotProductAttention. I'm not sure if this's right.

Jiayi-Pan commented 4 weeks ago

Hi we've solved the issue. You can just update the git submodule to the latest version

NiuMa-1234 commented 4 weeks ago

Hi we've solved the issue. You can just update the git submodule to the latest version

Hi, I've tested the latest TEDotProductAttentionMLA but I found the training speed has dropped a bit (from 5.9 tokens/s to 4.4 on 8*8B model). Would this be normal?

I used torch.profiler and I found the mainly difference of training time between these two versions is caused by this funtion : void transformer_engine::scaled_aligned_causal_masked_softmax_warp_forward<nv_bfloat16, __nv_bfloat16, float, 13>(nv_bfloat16, __nv_bfloat16 const, float, int, int, int)