Open Emibobo opened 3 months ago
I encountered the same issue as you. May I ask if your problem has been resolved? I look forward to your reply.
ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives.
One of the processes got stuck while waiting on a collective operation (all_gather etc.). It's likely a problem on your side rather than the code in the repo. It's also hard to debug without seeing the code and the python environment.
Troubleshooting tips:
请问解决了吗
[rank0]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1 [rank0]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info. [rank0]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLEMONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList.size() = 1 [rank1]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 1] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1 [rank1]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 1] ProcessGroupNCCL preparing to dump debug info. [rank1]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 1] [PG 0 Rank 1] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLEMONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList.size() = 1 [rank3]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 3] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1 [rank3]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 3] ProcessGroupNCCL preparing to dump debug info. [rank3]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 3] [PG 0 Rank 3] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLEMONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList.size() = 1 W0725 08:32:26.101000 140686322804544 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 651734 closing signal SIGTERM /home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 32 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' /home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 32 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' E0725 08:32:28.020000 140686322804544 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 651732) of binary: /home/zhuxiaobo/anaconda3/envs/my_pytorch_env/bin/python Traceback (most recent call last): File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launch.py", line 198, in
main()
File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launch.py", line 194, in main
launch(args)
File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launch.py", line 179, in launch
run(args)
File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/data/zxb/DINO_FL/train/train_stage.py FAILED
Failures: [1]: time : 2024-07-25_08:32:26 host : user rank : 1 (local_rank: 1) exitcode : -6 (pid: 651733) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 651733 [2]: time : 2024-07-25_08:32:26 host : user rank : 3 (local_rank: 3) exitcode : -6 (pid: 651735) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 651735
Root Cause (first observed failure): [0]: time : 2024-07-25_08:32:26 host : user rank : 0 (local_rank: 0) exitcode : -6 (pid: 651732) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 651732