Describe the bug
I am using the train_gpt3_175b_distributed.sh script to launch training on a single node with 4 A100 80GB GPUs. Training goes well if I use tensor parallel or pipeline parallel, but fails if I enable context parallel. The following is my script:
building GPT model ...
> number of parameters on (tensor, pipeline) model parallel rank (0, 0): 1718685696
> number of parameters on (tensor, pipeline) model parallel rank (0, 0): 1718685696
> number of parameters on (tensor, pipeline) model parallel rank (0, 0): 1718685696
> number of parameters on (tensor, pipeline) model parallel rank (0, 0): 1718685696
WARNING: could not find the metadata file /temp/latest_checkpointed_iteration.txt
will not load any checkpoints and will start from random
/pytorch/torch/distributed/c10d_logger.py:83: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
return func(*args, **kwargs)
/pytorch/torch/distributed/c10d_logger.py:83: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
return func(*args, **kwargs)
/pytorch/torch/distributed/c10d_logger.py:83: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
return func(*args, **kwargs)
/pytorch/torch/distributed/c10d_logger.py:83: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
return func(*args, **kwargs)
(min, max) time across ranks (ms):
load-checkpoint ................................: (0.73, 0.78)
[after model, optimizer, and learning rate scheduler are built] datetime: 2024-09-19 15:56:43
> building train, validation, and test datasets ...
> datasets target sizes (minimum size):
train: 500000
validation: 5010
test: 10
> building train, validation, and test datasets for GPT ...
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2024-09-19 15:56:43
done with setup ...
training ...
(min, max) time across ranks (ms):
model-and-optimizer-setup ......................: (300.04, 309.25)
train/valid/test-data-iterators-setup ..........: (288.56, 334.41)
[before the start of training step] datetime: 2024-09-19 15:56:43
WARNING:megatron.core.utils:NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)
['Traceback (most recent call last):\n', ' File "/Megatron-LM/pretrain_gpt.py", line 192, in forward_step\n output_tensor = model(tokens, position_ids, attention_mask,\n', ' File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py", line 305, in forward\n return self.module(*inputs, **kwargs)\n', ' File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/Megatron-LM/megatron/legacy/model/module.py", line 189, in forward\n outputs = self.module(*inputs, **kwargs)\n', ' File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/Megatron-LM/megatron/core/models/gpt/gpt_model.py", line 217, in forward\n hidden_states = self.decoder(\n', ' File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/Megatron-LM/megatron/core/transformer/transformer_block.py", line 496, in forward\n hidden_states, context = layer(\n', ' File "/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 377, in __call__\n return super(MegatronModule, self).__call__(*args, **kwargs)\n', ' File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 281, in forward\n attention_output_with_bias = self.self_attention(\n', ' File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/Megatron-LM/megatron/core/transformer/attention.py", line 291, in forward\n core_attn_out = self._checkpointed_attention_forward(\n', ' File "/Megatron-LM/megatron/core/transformer/attention.py", line 143, in _checkpointed_attention_forward\n hidden_states = tensor_parallel.checkpoint(\n', ' File "/Megatron-LM/megatron/core/tensor_parallel/random.py", line 308, in checkpoint\n return CheckpointFunction.apply(function, distribute_saved_activations, *args)\n', ' File "/pytorch/torch/autograd/function.py", line 575, in apply\n return super().apply(*args, **kwargs) # type: ignore[misc]\n', ' File "/Megatron-LM/megatron/core/tensor_parallel/random.py", line 247, in forward\n outputs = run_function(*args)\n', ' File "/Megatron-LM/megatron/core/transformer/attention.py", line 130, in custom_forward\n output_ = self.core_attention(\n', ' File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/Megatron-LM/megatron/core/extensions/transformer_engine.py", line 589, in forward\n core_attn_out = super().forward(\n', ' File "/miniconda3/envs/megatron_lm/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 6854, in forward\n return self.flash_attention(\n', ' File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/miniconda3/envs/megatron_lm/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 4328, in forward\n output = attn_forward_func_with_cp(\n', ' File "/miniconda3/envs/megatron_lm/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 3379, in attn_forward_func_with_cp\n out = AttnFuncWithCPAndKVP2P.apply(\n', ' File "/pytorch/torch/autograd/function.py", line 575, in apply\n return super().apply(*args, **kwargs) # type: ignore[misc]\n', ' File "/miniconda3/envs/megatron_lm/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 1473, in forward\n send_recv_reqs[i % 2] = flash_attn_p2p_communicate(\n', ' File "/miniconda3/envs/megatron_lm/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 1249, in flash_attn_p2p_communicate\n send_op = torch.distributed.isend(send_tensor, send_dst, cp_group)\n', ' File "/pytorch/torch/distributed/distributed_c10d.py", line 2071, in isend\n return pg.send([tensor], dst, tag)\n', 'RuntimeError: NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)\n']
[rank2]: Traceback (most recent call last):
[rank2]: File "/Megatron-LM/pretrain_gpt.py", line 264, in <module>
[rank2]: pretrain(
[rank2]: File "/Megatron-LM/megatron/training/training.py", line 355, in pretrain
[rank2]: iteration, num_floating_point_operations_so_far = train(
[rank2]: File "/Megatron-LM/megatron/training/training.py", line 1234, in train
[rank2]: train_step(forward_step_func,
[rank2]: File "/Megatron-LM/megatron/training/training.py", line 718, in train_step
[rank2]: losses_reduced = forward_backward_func(
[rank2]: File "/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 468, in forward_backward_no_pipelining
[rank2]: output_tensor, num_tokens = forward_step(
[rank2]: File "/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 273, in forward_step
[rank2]: output_tensor, loss_func = forward_step_func(data_iterator, model)
[rank2]: File "/Megatron-LM/pretrain_gpt.py", line 192, in forward_step
[rank2]: output_tensor = model(tokens, position_ids, attention_mask,
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py", line 305, in forward
[rank2]: return self.module(*inputs, **kwargs)
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/Megatron-LM/megatron/legacy/model/module.py", line 189, in forward
[rank2]: outputs = self.module(*inputs, **kwargs)
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/Megatron-LM/megatron/core/models/gpt/gpt_model.py", line 217, in forward
[rank2]: hidden_states = self.decoder(
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/Megatron-LM/megatron/core/transformer/transformer_block.py", line 496, in forward
[rank2]: hidden_states, context = layer(
[rank2]: File "/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 377, in __call__
[rank2]: return super(MegatronModule, self).__call__(*args, **kwargs)
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 281, in forward
[rank2]: attention_output_with_bias = self.self_attention(
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/Megatron-LM/megatron/core/transformer/attention.py", line 291, in forward
[rank2]: core_attn_out = self._checkpointed_attention_forward(
[rank2]: File "/Megatron-LM/megatron/core/transformer/attention.py", line 143, in _checkpointed_attention_forward
[rank2]: hidden_states = tensor_parallel.checkpoint(
[rank2]: File "/Megatron-LM/megatron/core/tensor_parallel/random.py", line 308, in checkpoint
[rank2]: return CheckpointFunction.apply(function, distribute_saved_activations, *args)
[rank2]: File "/pytorch/torch/autograd/function.py", line 575, in apply
[rank2]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank2]: File "/Megatron-LM/megatron/core/tensor_parallel/random.py", line 247, in forward
[rank2]: outputs = run_function(*args)
[rank2]: File "/Megatron-LM/megatron/core/transformer/attention.py", line 130, in custom_forward
[rank2]: output_ = self.core_attention(
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/Megatron-LM/megatron/core/extensions/transformer_engine.py", line 589, in forward
[rank2]: core_attn_out = super().forward(
[rank2]: File "/miniconda3/envs/megatron_lm/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 6854, in forward
[rank2]: return self.flash_attention(
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/miniconda3/envs/megatron_lm/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 4328, in forward
[rank2]: output = attn_forward_func_with_cp(
[rank2]: File "/miniconda3/envs/megatron_lm/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 3379, in attn_forward_func_with_cp
[rank2]: out = AttnFuncWithCPAndKVP2P.apply(
[rank2]: File "/pytorch/torch/autograd/function.py", line 575, in apply
[rank2]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank2]: File "/miniconda3/envs/megatron_lm/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 1473, in forward
[rank2]: send_recv_reqs[i % 2] = flash_attn_p2p_communicate(
[rank2]: File "/miniconda3/envs/megatron_lm/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 1249, in flash_attn_p2p_communicate
[rank2]: send_op = torch.distributed.isend(send_tensor, send_dst, cp_group)
[rank2]: File "/pytorch/torch/distributed/distributed_c10d.py", line 2071, in isend
[rank2]: return pg.send([tensor], dst, tag)
[rank2]: RuntimeError: NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)
To Reproduce
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.
Expected behavior
A clear and concise description of what you expected to happen.
Stack trace/logs
If applicable, add the stack trace or logs from the time of the error.
Environment (please complete the following information):
Describe the bug I am using the
train_gpt3_175b_distributed.sh
script to launch training on a single node with 4 A100 80GB GPUs. Training goes well if I use tensor parallel or pipeline parallel, but fails if I enable context parallel. The following is my script:The output log is:
To Reproduce Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.
Expected behavior A clear and concise description of what you expected to happen.
Stack trace/logs If applicable, add the stack trace or logs from the time of the error.
Environment (please complete the following information):
Proposed fix If you have a proposal for how to fix the issue state it here or link to a PR.
Additional context Add any other context about the problem here.