Open 1049451037 opened 2 weeks ago
Can you provide more information or a minimal reproducer?
This error suggests that the tensor-parallel group has not been properly configured. If you are using one of Megatron-LM's TE wrappers, the TP group must either be initialized prior to creating the layer (with megatron.core.parallel_state.initialize_model_parallel
) or registered after creating the layer (with TransformerEngineBaseModule.set_tensor_parallel_group
, see this Megatron-LM comment).
Seems that TEDotProductAttention doesn't call the set_tensor_parallel_group
function.
Not working after updating to the main branch of TE in Megatron-LM.