NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.28k stars 829 forks source link

Message truncated : received 256 bytes instead of 4 in bootstrap.cc #1176

Open stu1130 opened 9 months ago

stu1130 commented 9 months ago

We have observed the nccl initialization error on PyTorch NGC 23.12(with nccl 2.19.3 + cuda12.3) docker image on AWS P4DE(A100). The error surfaces during the nccl initialization and happen intermittently(like 1 out of 10 runs). When we downgraded the nccl version to 2.18.6, the issue was resolved.

The detailed messages are listed as follows.

1: compute-st-p4de24xlarge-2:4827:6900 [4] bootstrap.cc:77 NCCL WARN Message truncated : received 256 bytes instead of 4
1: compute-st-p4de24xlarge-2:4827:6900 [4] NCCL INFO bootstrap.cc:554 -> 3
1: compute-st-p4de24xlarge-2:4827:6900 [4] NCCL INFO transport.cc:204 -> 3
1: compute-st-p4de24xlarge-2:4827:6900 [4] NCCL INFO group.cc:110 -> 3
1: compute-st-p4de24xlarge-2:4827:6900 [4] NCCL INFO group.cc:64 -> 3 [Async thread]
1: compute-st-p4de24xlarge-2:4827:4827 [4] NCCL INFO group.cc:418 -> 3
1: compute-st-p4de24xlarge-2:4827:4827 [4] NCCL INFO group.cc:95 -> 3
2: [rank18]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
2: compute-st-p4de24xlarge-3:4871:7029 [2] NCCL INFO Channel 00/1 : 1[2] -> 0[2] [send] via NET/AWS Libfabric/1/GDRDMA/Shared
2: compute-st-p4de24xlarge-3:4871:7029 [2] NCCL INFO Channel 01/1 : 1[2] -> 0[2] [send] via NET/AWS Libfabric/1/GDRDMA/Shared
2: [rank16]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
2: compute-st-p4de24xlarge-3:4869:7030 [0] NCCL INFO Channel 00/1 : 1[0] -> 0[0] [send] via NET/AWS Libfabric/0/GDRDMA/Shared
2: compute-st-p4de24xlarge-3:4869:7030 [0] NCCL INFO Channel 01/1 : 1[0] -> 0[0] [send] via NET/AWS Libfabric/0/GDRDMA/Shared

1:     return forward_call(*args, **kwargs)
1:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/overrides/base.py", line 90, in forward
1:     output = self._forward_module.training_step(*inputs, **kwargs)
1:   File "/workspace/src/NeMo/nemo/utils/model_utils.py", line 381, in wrap_training_step
1:     output_dict = wrapped(*args, **kwargs)
1:   File "/workspace/src/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 670, in training_step
1:     loss_mean = self.fwd_bwd_step(dataloader_iter, batch_idx, False)
1:   File "/workspace/src/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 569, in fwd_bwd_step
1:     losses_reduced_per_micro_batch = fwd_bwd_function(
1:   File "/fsx-Training/XXX-training-fsx-prod-us-east-1/leecheng/stu1130_Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1272, in forward_backward_pipelining_without_interleaving
1:     output_tensor_grad = send_forward_recv_backward(
1:   File "/fsx-Training/XXX-training-fsx-prod-us-east-1/leecheng/stu1130_Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1068, in send_forward_recv_backward
1:     output_tensor_grad = p2p_communication.send_forward_recv_backward(
1:   File "/fsx-Training/XXX-training-fsx-prod-us-east-1/leecheng/stu1130_Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 451, in send_forward_recv_backward
1:     _, output_tensor_grad, _ = _communicate(
1:   File "/fsx-Training/XXX-training-fsx-prod-us-east-1/leecheng/stu1130_Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
1:     reqs = p2p_func(
1:   File "/fsx-Training/XXX-training-fsx-prod-us-east-1/leecheng/stu1130_Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
1:     reqs = torch.distributed.batch_isend_irecv(ops)
1:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1838, in batch_isend_irecv
1:     with _coalescing_manager(group, device, async_ops=True) as cm:
1:   File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
1:     next(self.gen)
1:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1785, in _coalescing_manager
1:     work = group._end_coalescing(device)
1: torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3616, internal error - please report this issue to the NCCL developers, NCCL version 2.19.3
1: ncclInternalError: Internal check failed.
1: Last error:
1: Message truncated : received 256 bytes instead of 4

The key error message is bootstrap.cc:77 NCCL WARN Message truncated : received 256 bytes instead of 4 Here are more details on environments. The training job used Nvidia NeMo r1.23.0 along with Megatron-LM core_r0.5.0. Both dense model and MoE trigged the error while we found MoE hit the error much more often. The error was raised by the first collective call batch_isend_irecv. When we disabled batch send/recv, the error message became

...
0:   File "/workspace/src/NeMo/nemo/utils/model_utils.py", line 381, in wrap_training_step
0:     output_dict = wrapped(*args, **kwargs)
0:   File "/workspace/src/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 670, in training_step
0:     loss_mean = self.fwd_bwd_step(dataloader_iter, batch_idx, False)
0:   File "/workspace/src/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 569, in fwd_bwd_step
0:     losses_reduced_per_micro_batch = fwd_bwd_function(
0:   File "/fsx-Training/XXX-training-fsx-prod-us-east-1/leecheng/stu1130_Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1272, in forward_backward_pipelining_without_interleaving
0:     output_tensor_grad = send_forward_recv_backward(
0:   File "/fsx-Training/XXX-training-fsx-prod-us-east-1/leecheng/stu1130_Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1068, in send_forward_recv_backward
0:     output_tensor_grad = p2p_communication.send_forward_recv_backward(
0:   File "/fsx-Training/XXX-training-fsx-prod-us-east-1/leecheng/stu1130_Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 451, in send_forward_recv_backward
0:     _, output_tensor_grad, _ = _communicate(
0:   File "/fsx-Training/XXX-training-fsx-prod-us-east-1/leecheng/stu1130_Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
0:     reqs = p2p_func(
0:   File "/fsx-Training/XXX-training-fsx-prod-us-east-1/leecheng/stu1130_Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 198, in _p2p_ops
0:     recv_next_req = torch.distributed.irecv(
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1606, in irecv
0:     return pg.recv([tensor], group_src_rank, tag)
0: RuntimeError: NCCL Error 3: internal error - please report this issue to the NCCL developers

Let me know if you need more info regarding the environment and setup.

stu1130 commented 9 months ago

Thanks Paul (@paul-gibbons) for quick suggestion. I tried latest NCCL 2.19.4 on NGC PyTorch 23.12 && 24.01 and can still reproduce the issue.

0: compute-st-p4de24xlarge-1:34639:36718 [5] bootstrap.cc:77 NCCL WARN Message truncated : received 256 bytes instead of 4
0: compute-st-p4de24xlarge-1:34639:36718 [5] NCCL INFO bootstrap.cc:554 -> 3
0: compute-st-p4de24xlarge-1:34639:36718 [5] NCCL INFO transport.cc:250 -> 3
0: compute-st-p4de24xlarge-1:34639:36718 [5] NCCL INFO group.cc:110 -> 3
0: compute-st-p4de24xlarge-1:34639:36718 [5] NCCL INFO group.cc:64 -> 3 [Async thread]
0: compute-st-p4de24xlarge-1:34639:34639 [5] NCCL INFO group.cc:418 -> 3
0: compute-st-p4de24xlarge-1:34639:34639 [5] NCCL INFO group.cc:95 -> 3
3: [rank24]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: [rank26]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: [rank25]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: [rank27]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: [rank29]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: compute-st-p4de24xlarge-4:34547:36693 [0] NCCL INFO Channel 00/1 : 1[0] -> 0[0] [send] via NET/AWS Libfabric/0/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34549:36694 [2] NCCL INFO Channel 00/1 : 1[2] -> 0[2] [send] via NET/AWS Libfabric/1/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34547:36693 [0] NCCL INFO Channel 01/1 : 1[0] -> 0[0] [send] via NET/AWS Libfabric/0/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34549:36694 [2] NCCL INFO Channel 01/1 : 1[2] -> 0[2] [send] via NET/AWS Libfabric/1/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34548:36695 [1] NCCL INFO Channel 00/1 : 1[1] -> 0[1] [send] via NET/AWS Libfabric/0/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34548:36695 [1] NCCL INFO Channel 01/1 : 1[1] -> 0[1] [send] via NET/AWS Libfabric/0/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34552:36697 [5] NCCL INFO Channel 00/1 : 1[5] -> 0[5] [send] via NET/AWS Libfabric/2/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34550:36696 [3] NCCL INFO Channel 00/1 : 1[3] -> 0[3] [send] via NET/AWS Libfabric/1/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34552:36697 [5] NCCL INFO Channel 01/1 : 1[5] -> 0[5] [send] via NET/AWS Libfabric/2/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34550:36696 [3] NCCL INFO Channel 01/1 : 1[3] -> 0[3] [send] via NET/AWS Libfabric/1/GDRDMA/Shared
3: [rank30]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: compute-st-p4de24xlarge-4:34553:36698 [6] NCCL INFO Channel 00/1 : 1[6] -> 0[6] [send] via NET/AWS Libfabric/3/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34553:36698 [6] NCCL INFO Channel 01/1 : 1[6] -> 0[6] [send] via NET/AWS Libfabric/3/GDRDMA/Shared
3: [rank31]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: compute-st-p4de24xlarge-4:34554:36699 [7] NCCL INFO Channel 00/1 : 1[7] -> 0[7] [send] via NET/AWS Libfabric/3/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34554:36699 [7] NCCL INFO Channel 01/1 : 1[7] -> 0[7] [send] via NET/AWS Libfabric/3/GDRDMA/Shared
3: [rank28]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: compute-st-p4de24xlarge-4:34551:36700 [4] NCCL INFO Channel 00/1 : 1[4] -> 0[4] [send] via NET/AWS Libfabric/2/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34551:36700 [4] NCCL INFO Channel 01/1 : 1[4] -> 0[4] [send] via NET/AWS Libfabric/2/GDRDMA/Shared
0: Error executing job with overrides: ['trainer.devices=8', 'trainer.num_nodes=4', 'run.user=leecheng-nemo-local', 'run.name=aws-batch-leecheng-moe', 'exp_manager.create_mlflow_logger=True', 'exp_manager.mlflow_logger_kwargs.tracking_uri=https://prod.us-east-1.internal.mlflow.XXX.amazon.dev', 'model.num_layers=10', 'model.tensor_model_parallel_size=8', 'model.pipeline_model_parallel_size=2', 'trainer.max_steps=1000', 'model.global_batch_size=64', 'model.micro_batch_size=4', 'trainer.val_check_interval=1', 'model.moe_grouped_gemm=true', '+model.batch_p2p_comm=True']
0: Traceback (most recent call last):
0:   File "/workspace/src/./src/main_pretrain_cp.py", line 34, in <module>
0:     main()
0:   File "/workspace/src/NeMo/nemo/core/config/hydra_runner.py", line 126, in wrapper
0:     _run_hydra(
0:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
0:     _run_app(
0:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
0:     run_and_report(
0:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
0:     raise ex
0:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
0:     return func()
0:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
0:     lambda: hydra.run(
0:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
0:     _ = ret.return_value
0:   File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
0:     raise self._return_value
0:   File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
0:     ret.return_value = task_function(task_cfg)
0:   File "/workspace/src/./src/main_pretrain_cp.py", line 30, in main
0:     trainer.fit(model)
0:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
0:     call._call_and_handle_interrupt(
0:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
0:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
0:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
0:     return function(*args, **kwargs)
0:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
0:     self._run(model, ckpt_path=ckpt_path)
0:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run
0:     results = self._run_stage()
0:     return self.model(*args, **kwargs)
0:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
0:     return self._call_impl(*args, **kwargs)
0:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
0:     return forward_call(*args, **kwargs)
0:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/overrides/base.py", line 90, in forward
0:     output = self._forward_module.training_step(*inputs, **kwargs)
0:   File "/workspace/src/NeMo/nemo/utils/model_utils.py", line 381, in wrap_training_step
0:     output_dict = wrapped(*args, **kwargs)
0:   File "/workspace/src/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 670, in training_step
0:     loss_mean = self.fwd_bwd_step(dataloader_iter, batch_idx, False)
0:   File "/workspace/src/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 569, in fwd_bwd_step
0:     losses_reduced_per_micro_batch = fwd_bwd_function(
0:   File "/workspace/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1270, in forward_backward_pipelining_without_interleaving
0:     output_tensor_grad = send_forward_recv_backward(
0:   File "/workspace/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1068, in send_forward_recv_backward
0:     output_tensor_grad = p2p_communication.send_forward_recv_backward(
0:   File "/workspace/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 451, in send_forward_recv_backward
0:     _, output_tensor_grad, _ = _communicate(
0:   File "/workspace/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
0:     reqs = p2p_func(
0:   File "/workspace/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
0:     reqs = torch.distributed.batch_isend_irecv(ops)
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1838, in batch_isend_irecv
0:     with _coalescing_manager(group, device, async_ops=True) as cm:
0:   File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
0:     next(self.gen)
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1785, in _coalescing_manager
0:     work = group._end_coalescing(device)
0: torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3616, internal error - please report this issue to the NCCL developers, NCCL version 2.19.4
0: ncclInternalError: Internal check failed.
0: Last error:
0: Message truncated : received 256 bytes instead of 4
AddyLaddy commented 9 months ago

The only times I have experienced that error are when different versions of NCCL were picked up in each rank. But I don't know how we could prove that is the case unless we logged the NCCL version from each rank.

babu111 commented 4 months ago

The only times I have experienced that error are when different versions of NCCL were picked up in each rank. But I don't know how we could prove that is the case unless we logged the NCCL version from each rank.

I'm using Ray and I also experienced the same problem. Different ray remote actors printed out different versions of NCCL+CUDA. But I don't understand how that could happen.