Emibobo commented 3 months ago

[rank0]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1 [rank0]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info. [rank0]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLEMONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList.size() = 1 [rank1]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 1] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1 [rank1]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 1] ProcessGroupNCCL preparing to dump debug info. [rank1]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 1] [PG 0 Rank 1] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLEMONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList.size() = 1 [rank3]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 3] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1 [rank3]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 3] ProcessGroupNCCL preparing to dump debug info. [rank3]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 3] [PG 0 Rank 3] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLEMONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList.size() = 1 W0725 08:32:26.101000 140686322804544 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 651734 closing signal SIGTERM /home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 32 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' /home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 32 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' E0725 08:32:28.020000 140686322804544 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 651732) of binary: /home/zhuxiaobo/anaconda3/envs/my_pytorch_env/bin/python Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launch.py", line 198, in main() File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launch.py", line 194, in main launch(args) File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launch.py", line 179, in launch run(args) File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/data/zxb/DINO_FL/train/train_stage.py FAILED

Failures: [1]: time : 2024-07-25_08:32:26 host : user rank : 1 (local_rank: 1) exitcode : -6 (pid: 651733) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 651733 [2]: time : 2024-07-25_08:32:26 host : user rank : 3 (local_rank: 3) exitcode : -6 (pid: 651735) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 651735

Root Cause (first observed failure): [0]: time : 2024-07-25_08:32:26 host : user rank : 0 (local_rank: 0) exitcode : -6 (pid: 651732) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 651732

SHAREN111 commented 3 months ago

I encountered the same issue as you. May I ask if your problem has been resolved? I look forward to your reply.

baldassarreFe commented 3 months ago

ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives.

One of the processes got stuck while waiting on a collective operation (all_gather etc.). It's likely a problem on your side rather than the code in the repo. It's also hard to debug without seeing the code and the python environment.

Troubleshooting tips:

Write a mini script that simply initializes distributed and runs some collectives to check that your GPUs can talk
Make a list of all modifications made on top of the provided code, check that all new collectives are executed in the same order on all ranks
Ensure that the modified code runs on a single GPU
Follow https://pytorch.org/docs/master/distributed.html#debugging-torch-distributed-applications

DarkJokers commented 1 month ago

请问解决了吗

facebookresearch / dinov2

torchrun --nproc_per_node=4 /data/zxb/DINO_FL/train/train.py #447

/data/zxb/DINO_FL/train/train_stage.py FAILED

Root Cause (first observed failure): [0]: time : 2024-07-25_08:32:26 host : user rank : 0 (local_rank: 0) exitcode : -6 (pid: 651732) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 651732