DeepGraphLearning / GearNet

GearNet and Geometric Pretraining Methods for Protein Structure Representation Learning, ICLR'2023 (https://arxiv.org/abs/2203.06125)
MIT License
263 stars 27 forks source link

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 20) of binary.... #13

Open bozhenhhu opened 1 year ago

bozhenhhu commented 1 year ago

When I run python -m torch.distributed.launch --nproc_per_node=4 script/downstream.py -c config/downstream/GO-BP/gearnet_edge.yaml --gpus [0,1,2,3] --ckpt on worker*1 Tesla-V100-SXM2-32GB:4 GPU, 47 CPU, I got the error:

[219013] [E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1804901 milliseconds before timing out. [219014] [E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out. [219015] [E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805985 milliseconds before timing out. [219016] Traceback (most recent call last): [219017] File "/hubozhen/GearNet/script/downstream.py", line 75, in [219018] train_and_validate(cfg, solver, scheduler) [219019] File "/hubozhen/GearNet/script/downstream.py", line 30, in train_and_validate [219020] solver.train(kwargs) [219021] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/core/engine.py", line 155, in train [219022] loss, metric = model(batch) [219023] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl [219024] return forward_call(*input, *kwargs) [219025] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 886, in forward [219026] output = self.module(inputs[0], kwargs[0]) [219027] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl [219028] return forward_call(*input, kwargs) [219029] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/tasks/property_prediction.py", line 279, in forward [219030] pred = self.predict(batch, all_loss, metric) [219031] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/tasks/property_prediction.py", line 300, in predict [219032] output = self.model(graph, graph.node_feature.float(), all_loss=all_loss, metric=metric) [219033] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl [219034] return forward_call(*input, *kwargs) [219035] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/models/gearnet.py", line 99, in forward [219036] edge_hidden = self.edge_layers[i](line_graph, edge_input) [219037] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl [219038] return forward_call(input, kwargs) [219039] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/layers/conv.py", line 92, in forward [219040] output = self.combine(input, update) [219041] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/layers/conv.py", line 438, in combine [219042] output = self.batch_norm(output) [219043] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl [219044] return forward_call(*input, *kwargs) [219045] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 758, in forward [219046] world_size, [219047] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/_functions.py", line 42, in forward [219048] dist._all_gather_base(combined_flat, combined, process_group, async_op=False) [219049] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2070, in _all_gather_base [219050] work = group._allgather_base(output_tensor, input_tensor) [219051] RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1804901 milliseconds before timing out. [219052] /opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/layers/functional/functional.py:474: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). [219053] index1 = local_index // local_inner_size + offset1 [219054] /opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/layers/functional/functional.py:474: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). [219055] index1 = local_index // local_inner_size + offset1 [219056] [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. [219057] terminate called after throwing an instance of 'std::runtime_error' [219058] what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1804901 milliseconds before timing out. [219059] [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. [219060] terminate called after throwing an instance of 'std::runtime_error' [219061] what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out. [219062] /opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/data/graph.py:1667: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). [219063] edge_in_index = local_index // local_inner_size + edge_in_offset [219064] [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. [219065] terminate called after throwing an instance of 'std::runtime_error' [219066] what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805985 milliseconds before timing out. [219067] WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 21 closing signal SIGTERM [219068] ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 20) of binary: /opt/anaconda3/envs/manifold/bin/python [219069] Traceback (most recent call last): [219070] File "/opt/anaconda3/envs/manifold/lib/python3.7/runpy.py", line 193, in _run_module_as_main [219071] "main", mod_spec) [219072] File "/opt/anaconda3/envs/manifold/lib/python3.7/runpy.py", line 85, in _run_code [219073] exec(code, run_globals) [219074] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in [219075] main() [219076] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main [219077] launch(args) [219078] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch [219079] run(args) [219080] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run [219081] )(cmd_args) [219082] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call [219083] return launch_agent(self._config, self._entrypoint, list(args)) [219084] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent [219085] failures=result.failures, [219086] torch.distributed.elastic.multiprocessing.errors.ChildFailedError: [219087] =================================================== [219088] /hubozhen/GearNet/script/downstream.py FAILED [219089] --------------------------------------------------- [219090] Failures: [219091] [1]: [219092] time : 2022-12-12_09:41:02 [219093] host : pytorch-7c3c96f1-d9hcm [219094] rank : 2 (local_rank: 2) [219095] exitcode : -6 (pid: 22) [219096] error_file: <N/A> [219097] traceback : Signal 6 (SIGABRT) received by PID 22 [219098] [2]: [219099] time : 2022-12-12_09:41:02 [219100] host : pytorch-7c3c96f1-d9hcm [219101] rank : 3 (local_rank: 3) [219102] exitcode : -6 (pid: 23) [219103] error_file: <N/A> [219104] traceback : Signal 6 (SIGABRT) received by PID 23 [219105] --------------------------------------------------- [219106] Root Cause (first observed failure): [219107] [0]: [219108] time : 2022-12-12_09:41:02 [219109] host : pytorch-7c3c96f1-d9hcm [219110] rank : 0 (local_rank: 0) [219111] exitcode : -6 (pid: 20) [219112] error_file: <N/A> [219113] traceback : Signal 6 (SIGABRT) received by PID 20 [219114] ===================================================

Someone said this happened when loading big data, I find the use ratios of these for GPUs are 100%. However, I changed the same procedure on another V100 mechaine (worker*1: Tesla-V100-SXM-32GB:4 GPU, 48 CPU,), it is OK. It confused me.

Oxer11 commented 1 year ago

Hi! I'm sorry that I am not familiar with elastic and don't know the reason for your problem. Maybe you could try to use the typical DDP in torch instead of elastic?