Open stu1130 opened 9 months ago
Thanks Paul (@paul-gibbons) for quick suggestion. I tried latest NCCL 2.19.4 on NGC PyTorch 23.12 && 24.01 and can still reproduce the issue.
0: compute-st-p4de24xlarge-1:34639:36718 [5] bootstrap.cc:77 NCCL WARN Message truncated : received 256 bytes instead of 4
0: compute-st-p4de24xlarge-1:34639:36718 [5] NCCL INFO bootstrap.cc:554 -> 3
0: compute-st-p4de24xlarge-1:34639:36718 [5] NCCL INFO transport.cc:250 -> 3
0: compute-st-p4de24xlarge-1:34639:36718 [5] NCCL INFO group.cc:110 -> 3
0: compute-st-p4de24xlarge-1:34639:36718 [5] NCCL INFO group.cc:64 -> 3 [Async thread]
0: compute-st-p4de24xlarge-1:34639:34639 [5] NCCL INFO group.cc:418 -> 3
0: compute-st-p4de24xlarge-1:34639:34639 [5] NCCL INFO group.cc:95 -> 3
3: [rank24]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: [rank26]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: [rank25]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: [rank27]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: [rank29]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: compute-st-p4de24xlarge-4:34547:36693 [0] NCCL INFO Channel 00/1 : 1[0] -> 0[0] [send] via NET/AWS Libfabric/0/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34549:36694 [2] NCCL INFO Channel 00/1 : 1[2] -> 0[2] [send] via NET/AWS Libfabric/1/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34547:36693 [0] NCCL INFO Channel 01/1 : 1[0] -> 0[0] [send] via NET/AWS Libfabric/0/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34549:36694 [2] NCCL INFO Channel 01/1 : 1[2] -> 0[2] [send] via NET/AWS Libfabric/1/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34548:36695 [1] NCCL INFO Channel 00/1 : 1[1] -> 0[1] [send] via NET/AWS Libfabric/0/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34548:36695 [1] NCCL INFO Channel 01/1 : 1[1] -> 0[1] [send] via NET/AWS Libfabric/0/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34552:36697 [5] NCCL INFO Channel 00/1 : 1[5] -> 0[5] [send] via NET/AWS Libfabric/2/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34550:36696 [3] NCCL INFO Channel 00/1 : 1[3] -> 0[3] [send] via NET/AWS Libfabric/1/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34552:36697 [5] NCCL INFO Channel 01/1 : 1[5] -> 0[5] [send] via NET/AWS Libfabric/2/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34550:36696 [3] NCCL INFO Channel 01/1 : 1[3] -> 0[3] [send] via NET/AWS Libfabric/1/GDRDMA/Shared
3: [rank30]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: compute-st-p4de24xlarge-4:34553:36698 [6] NCCL INFO Channel 00/1 : 1[6] -> 0[6] [send] via NET/AWS Libfabric/3/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34553:36698 [6] NCCL INFO Channel 01/1 : 1[6] -> 0[6] [send] via NET/AWS Libfabric/3/GDRDMA/Shared
3: [rank31]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: compute-st-p4de24xlarge-4:34554:36699 [7] NCCL INFO Channel 00/1 : 1[7] -> 0[7] [send] via NET/AWS Libfabric/3/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34554:36699 [7] NCCL INFO Channel 01/1 : 1[7] -> 0[7] [send] via NET/AWS Libfabric/3/GDRDMA/Shared
3: [rank28]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: compute-st-p4de24xlarge-4:34551:36700 [4] NCCL INFO Channel 00/1 : 1[4] -> 0[4] [send] via NET/AWS Libfabric/2/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34551:36700 [4] NCCL INFO Channel 01/1 : 1[4] -> 0[4] [send] via NET/AWS Libfabric/2/GDRDMA/Shared
0: Error executing job with overrides: ['trainer.devices=8', 'trainer.num_nodes=4', 'run.user=leecheng-nemo-local', 'run.name=aws-batch-leecheng-moe', 'exp_manager.create_mlflow_logger=True', 'exp_manager.mlflow_logger_kwargs.tracking_uri=https://prod.us-east-1.internal.mlflow.XXX.amazon.dev', 'model.num_layers=10', 'model.tensor_model_parallel_size=8', 'model.pipeline_model_parallel_size=2', 'trainer.max_steps=1000', 'model.global_batch_size=64', 'model.micro_batch_size=4', 'trainer.val_check_interval=1', 'model.moe_grouped_gemm=true', '+model.batch_p2p_comm=True']
0: Traceback (most recent call last):
0: File "/workspace/src/./src/main_pretrain_cp.py", line 34, in <module>
0: main()
0: File "/workspace/src/NeMo/nemo/core/config/hydra_runner.py", line 126, in wrapper
0: _run_hydra(
0: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
0: _run_app(
0: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
0: run_and_report(
0: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
0: raise ex
0: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
0: return func()
0: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
0: lambda: hydra.run(
0: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
0: _ = ret.return_value
0: File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
0: raise self._return_value
0: File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
0: ret.return_value = task_function(task_cfg)
0: File "/workspace/src/./src/main_pretrain_cp.py", line 30, in main
0: trainer.fit(model)
0: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
0: call._call_and_handle_interrupt(
0: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
0: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
0: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
0: return function(*args, **kwargs)
0: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
0: self._run(model, ckpt_path=ckpt_path)
0: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run
0: results = self._run_stage()
0: return self.model(*args, **kwargs)
0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
0: return self._call_impl(*args, **kwargs)
0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
0: return forward_call(*args, **kwargs)
0: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/overrides/base.py", line 90, in forward
0: output = self._forward_module.training_step(*inputs, **kwargs)
0: File "/workspace/src/NeMo/nemo/utils/model_utils.py", line 381, in wrap_training_step
0: output_dict = wrapped(*args, **kwargs)
0: File "/workspace/src/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 670, in training_step
0: loss_mean = self.fwd_bwd_step(dataloader_iter, batch_idx, False)
0: File "/workspace/src/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 569, in fwd_bwd_step
0: losses_reduced_per_micro_batch = fwd_bwd_function(
0: File "/workspace/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1270, in forward_backward_pipelining_without_interleaving
0: output_tensor_grad = send_forward_recv_backward(
0: File "/workspace/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1068, in send_forward_recv_backward
0: output_tensor_grad = p2p_communication.send_forward_recv_backward(
0: File "/workspace/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 451, in send_forward_recv_backward
0: _, output_tensor_grad, _ = _communicate(
0: File "/workspace/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
0: reqs = p2p_func(
0: File "/workspace/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
0: reqs = torch.distributed.batch_isend_irecv(ops)
0: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1838, in batch_isend_irecv
0: with _coalescing_manager(group, device, async_ops=True) as cm:
0: File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
0: next(self.gen)
0: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1785, in _coalescing_manager
0: work = group._end_coalescing(device)
0: torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3616, internal error - please report this issue to the NCCL developers, NCCL version 2.19.4
0: ncclInternalError: Internal check failed.
0: Last error:
0: Message truncated : received 256 bytes instead of 4
The only times I have experienced that error are when different versions of NCCL were picked up in each rank. But I don't know how we could prove that is the case unless we logged the NCCL version from each rank.
The only times I have experienced that error are when different versions of NCCL were picked up in each rank. But I don't know how we could prove that is the case unless we logged the NCCL version from each rank.
I'm using Ray and I also experienced the same problem. Different ray remote actors printed out different versions of NCCL+CUDA. But I don't understand how that could happen.
We have observed the nccl initialization error on PyTorch NGC 23.12(with nccl 2.19.3 + cuda12.3) docker image on AWS P4DE(A100). The error surfaces during the nccl initialization and happen intermittently(like 1 out of 10 runs). When we downgraded the nccl version to 2.18.6, the issue was resolved.
The detailed messages are listed as follows.
The key error message is
bootstrap.cc:77 NCCL WARN Message truncated : received 256 bytes instead of 4
Here are more details on environments. The training job used Nvidia NeMo r1.23.0 along with Megatron-LM core_r0.5.0. Both dense model and MoE trigged the error while we found MoE hit the error much more often. The error was raised by the first collective callbatch_isend_irecv
. When we disabled batch send/recv, the error message becameLet me know if you need more info regarding the environment and setup.