NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.23k stars 2.08k forks source link

[BUGS] Pipeline Parallelism fails/hangs with Megatron Core example #881

Open schheda1 opened 1 week ago

schheda1 commented 1 week ago

Describe the bug When the provided example script is configured to use pipeline parallelism, two different behaviours are observed.

  1. When tensor parallelism (tp) = 1 and pipeline parallelism (pp) = {2,4}, the script fails and CUDA device side assertions are triggered. -- Tested on 1 node with 2,4 GPUs.
  2. When tp={2} and pp={2,4} on 1-2 nodes, the script hangs and does not return anything. NCCL_DEBUG does not throw any errors.

To Reproduce PP and TP are modified manually with arguments to initialize_distributed(). Script run as - srun python -u run_simple_mcore_train_loop.py

Expected behavior The example runs without throwing any errors.

Stack trace/logs Truncated stack trace for case 1:

...
WARNING:megatron.core.datasets.gpt_dataset:Unable to save the MockGPTDataset indexes because path_to_cache is None
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 1066985
INFO:megatron.core.datasets.gpt_dataset:> total number of epochs: 1
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [22,0,0], thread: [64,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [22,0,0], thread: [65,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [22,0,0], thread: [66,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [22,0,0], thread: [67,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
...

[rank0]: Traceback (most recent call last):
[rank0]:   File "/scratch/sd/u/user/Megatron-LM/examples/run_simple_mcore_train_loop.py", line 143, in <module>
[rank0]:     losses_reduced = forward_backward_func(
[rank0]:   File "/scratch/sd/u/user/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1271, in forward_backward_pipelining_without_interleaving
[rank0]:     output_tensor, num_tokens = forward_step(
[rank0]:   File "/scratch/sd/u/user/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 206, in forward_step
[rank0]:     output_tensor, loss_func = forward_step_func(data_iterator, model)
[rank0]:   File "/scratch/sd/u/user/Megatron-LM/examples/run_simple_mcore_train_loop.py", line 110, in forward_step_func
[rank0]:     output_tensor = model(tokens, position_ids, attention_mask,
[rank0]:   File "/scratch/sd/u/user/torch/env2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/scratch/sd/u/user/torch/env2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/scratch/sd/u/user/Megatron-LM/megatron/core/models/gpt/gpt_model.py", line 175, in forward
[rank0]:     decoder_input = self.embedding(input_ids=input_ids, position_ids=position_ids)
[rank0]:   File "/scratch/sd/u/user/torch/env2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/scratch/sd/u/user/torch/env2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/scratch/sd/u/user/Megatron-LM/megatron/core/models/common/embeddings/language_model_embedding.py", line 100, in forward
[rank0]:     word_embeddings = self.word_embeddings(input_ids)
[rank0]:   File "/scratch/sd/u/user/torch/env2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/scratch/sd/u/user/torch/env2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/scratch/sd/u/user/Megatron-LM/megatron/core/tensor_parallel/layers.py", line 229, in forward
[rank0]:     output_parallel = self.weight[masked_input]
[rank0]: RuntimeError: CUDA error: device-side assert triggered
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.



Stderr for Case 2:

...
WARNING:megatron.core.datasets.gpt_dataset:Unable to save the MockGPTDataset indexes because path_to_cache is None
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 1066985
INFO:megatron.core.datasets.gpt_dataset:> total number of epochs: 1
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

Environment (please complete the following information):

Proposed fix N/A

Additional context For the case where the hangs were observed, the script seems to run into a problem right around the pipeline warmup phase and before 1F1B schedule in forward_backward_pipelining_without_interleaving() in megatron/core/pipeline_parallel/schedules.py.

Any resolution would be greatly appreciated!

schheda1 commented 15 hours ago

No, srun was used to launch the example. The amount of resources are controlled by sbatch during submission and read accordingly by modifying initialize_distributed() in the example script. An excerpt for reference:

    rank = int(os.getenv("SLURM_PROCID"))
    local_rank = int(os.getenv("SLURM_LOCALID"))
    world_size = int(os.getenv("SLURM_NTASKS"))
    address = os.getenv("SLURM_LAUNCH_NODE_IPADDR")
    port = "29500"
    os.environ["MASTER_ADDR"] = address
    os.environ["MASTER_PORT"] = port

    torch.cuda.set_device(local_rank)
    torch.distributed.init_process_group(backend="nccl", init_method="env://",
                                         world_size=world_size, rank=rank)

The indexing errors have been fixed locally, however the hangs still remain when TP=1, PP=2 and 1 node with 2 GPUs are allocated to the job.