Open stu1130 opened 5 months ago
Thanks Paul (@paul-gibbons) for quick suggestion. I tried latest NCCL 2.19.4 on NGC PyTorch 23.12 && 24.01 and can still reproduce the issue.
0: compute-st-p4de24xlarge-1:34639:36718 [5] bootstrap.cc:77 NCCL WARN Message truncated : received 256 bytes instead of 4
0: compute-st-p4de24xlarge-1:34639:36718 [5] NCCL INFO bootstrap.cc:554 -> 3
0: compute-st-p4de24xlarge-1:34639:36718 [5] NCCL INFO transport.cc:250 -> 3
0: compute-st-p4de24xlarge-1:34639:36718 [5] NCCL INFO group.cc:110 -> 3
0: compute-st-p4de24xlarge-1:34639:36718 [5] NCCL INFO group.cc:64 -> 3 [Async thread]
0: compute-st-p4de24xlarge-1:34639:34639 [5] NCCL INFO group.cc:418 -> 3
0: compute-st-p4de24xlarge-1:34639:34639 [5] NCCL INFO group.cc:95 -> 3
3: [rank24]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: [rank26]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: [rank25]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: [rank27]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: [rank29]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: compute-st-p4de24xlarge-4:34547:36693 [0] NCCL INFO Channel 00/1 : 1[0] -> 0[0] [send] via NET/AWS Libfabric/0/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34549:36694 [2] NCCL INFO Channel 00/1 : 1[2] -> 0[2] [send] via NET/AWS Libfabric/1/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34547:36693 [0] NCCL INFO Channel 01/1 : 1[0] -> 0[0] [send] via NET/AWS Libfabric/0/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34549:36694 [2] NCCL INFO Channel 01/1 : 1[2] -> 0[2] [send] via NET/AWS Libfabric/1/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34548:36695 [1] NCCL INFO Channel 00/1 : 1[1] -> 0[1] [send] via NET/AWS Libfabric/0/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34548:36695 [1] NCCL INFO Channel 01/1 : 1[1] -> 0[1] [send] via NET/AWS Libfabric/0/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34552:36697 [5] NCCL INFO Channel 00/1 : 1[5] -> 0[5] [send] via NET/AWS Libfabric/2/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34550:36696 [3] NCCL INFO Channel 00/1 : 1[3] -> 0[3] [send] via NET/AWS Libfabric/1/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34552:36697 [5] NCCL INFO Channel 01/1 : 1[5] -> 0[5] [send] via NET/AWS Libfabric/2/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34550:36696 [3] NCCL INFO Channel 01/1 : 1[3] -> 0[3] [send] via NET/AWS Libfabric/1/GDRDMA/Shared
3: [rank30]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: compute-st-p4de24xlarge-4:34553:36698 [6] NCCL INFO Channel 00/1 : 1[6] -> 0[6] [send] via NET/AWS Libfabric/3/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34553:36698 [6] NCCL INFO Channel 01/1 : 1[6] -> 0[6] [send] via NET/AWS Libfabric/3/GDRDMA/Shared
3: [rank31]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: compute-st-p4de24xlarge-4:34554:36699 [7] NCCL INFO Channel 00/1 : 1[7] -> 0[7] [send] via NET/AWS Libfabric/3/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34554:36699 [7] NCCL INFO Channel 01/1 : 1[7] -> 0[7] [send] via NET/AWS Libfabric/3/GDRDMA/Shared
3: [rank28]:[W ProcessGroupNCCL.cpp:2302] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
3: compute-st-p4de24xlarge-4:34551:36700 [4] NCCL INFO Channel 00/1 : 1[4] -> 0[4] [send] via NET/AWS Libfabric/2/GDRDMA/Shared
3: compute-st-p4de24xlarge-4:34551:36700 [4] NCCL INFO Channel 01/1 : 1[4] -> 0[4] [send] via NET/AWS Libfabric/2/GDRDMA/Shared
0: Error executing job with overrides: ['trainer.devices=8', 'trainer.num_nodes=4', 'run.user=leecheng-nemo-local', 'run.name=aws-batch-leecheng-moe', 'exp_manager.create_mlflow_logger=True', 'exp_manager.mlflow_logger_kwargs.tracking_uri=https://prod.us-east-1.internal.mlflow.XXX.amazon.dev', 'model.num_layers=10', 'model.tensor_model_parallel_size=8', 'model.pipeline_model_parallel_size=2', 'trainer.max_steps=1000', 'model.global_batch_size=64', 'model.micro_batch_size=4', 'trainer.val_check_interval=1', 'model.moe_grouped_gemm=true', '+model.batch_p2p_comm=True']
0: Traceback (most recent call last):
0: File "/workspace/src/./src/main_pretrain_cp.py", line 34, in <module>
0: main()
0: File "/workspace/src/NeMo/nemo/core/config/hydra_runner.py", line 126, in wrapper
0: _run_hydra(
0: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
0: _run_app(
0: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
0: run_and_report(
0: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
0: raise ex
0: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
0: return func()
0: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
0: lambda: hydra.run(
0: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
0: _ = ret.return_value
0: File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
0: raise self._return_value
0: File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
0: ret.return_value = task_function(task_cfg)
0: File "/workspace/src/./src/main_pretrain_cp.py", line 30, in main
0: trainer.fit(model)
0: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
0: call._call_and_handle_interrupt(
0: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
0: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
0: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
0: return function(*args, **kwargs)
0: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
0: self._run(model, ckpt_path=ckpt_path)
0: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run
0: results = self._run_stage()
0: return self.model(*args, **kwargs)
0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
0: return self._call_impl(*args, **kwargs)
0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
0: return forward_call(*args, **kwargs)
0: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/overrides/base.py", line 90, in forward
0: output = self._forward_module.training_step(*inputs, **kwargs)
0: File "/workspace/src/NeMo/nemo/utils/model_utils.py", line 381, in wrap_training_step
0: output_dict = wrapped(*args, **kwargs)
0: File "/workspace/src/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 670, in training_step
0: loss_mean = self.fwd_bwd_step(dataloader_iter, batch_idx, False)
0: File "/workspace/src/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 569, in fwd_bwd_step
0: losses_reduced_per_micro_batch = fwd_bwd_function(
0: File "/workspace/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1270, in forward_backward_pipelining_without_interleaving
0: output_tensor_grad = send_forward_recv_backward(
0: File "/workspace/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1068, in send_forward_recv_backward
0: output_tensor_grad = p2p_communication.send_forward_recv_backward(
0: File "/workspace/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 451, in send_forward_recv_backward
0: _, output_tensor_grad, _ = _communicate(
0: File "/workspace/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
0: reqs = p2p_func(
0: File "/workspace/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
0: reqs = torch.distributed.batch_isend_irecv(ops)
0: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1838, in batch_isend_irecv
0: with _coalescing_manager(group, device, async_ops=True) as cm:
0: File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
0: next(self.gen)
0: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1785, in _coalescing_manager
0: work = group._end_coalescing(device)
0: torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3616, internal error - please report this issue to the NCCL developers, NCCL version 2.19.4
0: ncclInternalError: Internal check failed.
0: Last error:
0: Message truncated : received 256 bytes instead of 4
The only times I have experienced that error are when different versions of NCCL were picked up in each rank. But I don't know how we could prove that is the case unless we logged the NCCL version from each rank.
We have observed the nccl initialization error on PyTorch NGC 23.12(with nccl 2.19.3 + cuda12.3) docker image on AWS P4DE(A100). The error surfaces during the nccl initialization and happen intermittently(like 1 out of 10 runs). When we downgraded the nccl version to 2.18.6, the issue was resolved.
The detailed messages are listed as follows.
The key error message is
bootstrap.cc:77 NCCL WARN Message truncated : received 256 bytes instead of 4
Here are more details on environments. The training job used Nvidia NeMo r1.23.0 along with Megatron-LM core_r0.5.0. Both dense model and MoE trigged the error while we found MoE hit the error much more often. The error was raised by the first collective callbatch_isend_irecv
. When we disabled batch send/recv, the error message becameLet me know if you need more info regarding the environment and setup.