Open 1049451037 opened 3 months ago
Thank you for let us know! We have a fix, but it's not yet merged. Temporarily WAR is replace tensor_model_parallel_size * context_parallel_size
with just tensor_model_parallel_size
.
This issue should has been resolved on https://github.com/NVIDIA/Megatron-LM/commit/b5aba3a2f3165da8b4f6b483bf3a6da2a24718e4
Marking as stale. No activity in 60 days.
https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/parallel_state.py#L503
As shown in the code:
The length of
[start_rank, end_rank)
istp*ep
. But thefor
loop ink
istp*cp
. Ifcp>ep
, it will makeranks
empty, which causes the error.