[BUGS] Pipeline Parallelism fails/hangs with Megatron Core example

Describe the bug When the provided example script is configured to use pipeline parallelism, two different behaviours are observed.

When tensor parallelism (tp) = 1 and pipeline parallelism (pp) = {2,4}, the script fails and CUDA device side assertions are triggered. -- Tested on 1 node with 2,4 GPUs.
When tp={2} and pp={2,4} on 1-2 nodes, the script hangs and does not return anything. NCCL_DEBUG does not throw any errors.

To Reproduce PP and TP are modified manually with arguments to initialize_distributed(). Script run as - srun python -u run_simple_mcore_train_loop.py

Expected behavior The example runs without throwing any errors.

Stack trace/logs Truncated stack trace for case 1:

...
WARNING:megatron.core.datasets.gpt_dataset:Unable to save the MockGPTDataset indexes because path_to_cache is None
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 1066985
INFO:megatron.core.datasets.gpt_dataset:> total number of epochs: 1
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [22,0,0], thread: [64,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [22,0,0], thread: [65,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [22,0,0], thread: [66,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [22,0,0], thread: [67,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
...

[rank0]: Traceback (most recent call last):
[rank0]:   File "/scratch/sd/u/user/Megatron-LM/examples/run_simple_mcore_train_loop.py", line 143, in <module>
[rank0]:     losses_reduced = forward_backward_func(
[rank0]:   File "/scratch/sd/u/user/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1271, in forward_backward_pipelining_without_interleaving
[rank0]:     output_tensor, num_tokens = forward_step(
[rank0]:   File "/scratch/sd/u/user/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 206, in forward_step
[rank0]:     output_tensor, loss_func = forward_step_func(data_iterator, model)
[rank0]:   File "/scratch/sd/u/user/Megatron-LM/examples/run_simple_mcore_train_loop.py", line 110, in forward_step_func
[rank0]:     output_tensor = model(tokens, position_ids, attention_mask,
[rank0]:   File "/scratch/sd/u/user/torch/env2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/scratch/sd/u/user/torch/env2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/scratch/sd/u/user/Megatron-LM/megatron/core/models/gpt/gpt_model.py", line 175, in forward
[rank0]:     decoder_input = self.embedding(input_ids=input_ids, position_ids=position_ids)
[rank0]:   File "/scratch/sd/u/user/torch/env2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/scratch/sd/u/user/torch/env2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/scratch/sd/u/user/Megatron-LM/megatron/core/models/common/embeddings/language_model_embedding.py", line 100, in forward
[rank0]:     word_embeddings = self.word_embeddings(input_ids)
[rank0]:   File "/scratch/sd/u/user/torch/env2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/scratch/sd/u/user/torch/env2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/scratch/sd/u/user/Megatron-LM/megatron/core/tensor_parallel/layers.py", line 229, in forward
[rank0]:     output_parallel = self.weight[masked_input]
[rank0]: RuntimeError: CUDA error: device-side assert triggered
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Stderr for Case 2:

...
WARNING:megatron.core.datasets.gpt_dataset:Unable to save the MockGPTDataset indexes because path_to_cache is None
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 1066985
INFO:megatron.core.datasets.gpt_dataset:> total number of epochs: 1
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

Environment (please complete the following information):

Megatron-LM commit ID: e33c8f7
PyTorch version: 2.5.0a0+git153362f
CUDA version: 12.2
NCCL version: 2.19.4

Proposed fix N/A

Additional context For the case where the hangs were observed, the script seems to run into a problem right around the pipeline warmup phase and before 1F1B schedule in forward_backward_pipelining_without_interleaving() in megatron/core/pipeline_parallel/schedules.py.

Any resolution would be greatly appreciated!

NVIDIA / Megatron-LM

[BUGS] Pipeline Parallelism fails/hangs with Megatron Core example #881