Open dreasysnail opened 1 month ago
@Jiayi-Pan
Hi, have you solved the problem? I 'm trying to use TEDotProductAttentionMLA, too, and I found the difference between it and its original TEDotProductAttention is only the definition of kv_channels. So I just manually change the kv_channels and keep using the TEDotProductAttention. I'm not sure if this's right.
Hi we've solved the issue. You can just update the git submodule to the latest version
Hi we've solved the issue. You can just update the git submodule to the latest version
Hi, I've tested the latest TEDotProductAttentionMLA but I found the training speed has dropped a bit (from 5.9 tokens/s to 4.4 on 8*8B model). Would this be normal?
I used torch.profiler and I found the mainly difference of training time between these two versions is caused by this funtion : void transformer_engine::scaled_aligned_causal_masked_softmax_warp_forward<nv_bfloat16, __nv_bfloat16, float, 13>(nv_bfloat16, __nv_bfloat16 const, float, int, int, int)
Thank you for the great project! When I run
examples/deepseek_v2/run_mcore_deepseek.sh
I got an error as below:It appears that in this link the code is attempting to import 'TEDotProductAttentionMLA', but when I checked the
megatron.core.transformer.custom_layers.transformer_engine
file, I did not find 'TEDotProductAttentionMLA'.Any help appreciated!