Haiyang-W / DSVT

[CVPR2023] Official Implementation of "DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets"
https://arxiv.org/abs/2301.06051
Apache License 2.0
373 stars 28 forks source link

Timeout when training #39

Closed Wang0203 closed 1 year ago

Wang0203 commented 1 year ago
2023-06-30 02:10:13,430   INFO  epoch: 14/20, acc_iter=112400, cur_iter=4263/7724, batch_size=4, time_cost(epoch): 1:04:10/52:05, time_cost(all): 16:04:06/10:33:21, loss=1.5466811966896057, d_time=0.02(0.02), f_time=0.81(0.88), b_time=0.83(0.90), norm=28.733774185180664, lr=0.002140055979797459
2023-06-30 02:10:58,540   INFO  epoch: 14/20, acc_iter=112450, cur_iter=4313/7724, batch_size=4, time_cost(epoch): 1:04:55/51:20, time_cost(all): 16:04:51/10:32:35, loss=1.5467678713798523, d_time=0.02(0.02), f_time=0.87(0.88), b_time=0.89(0.90), norm=28.955286026000977, lr=0.002135863906495637
2023-06-30 02:11:03,331   INFO  Save latest model to /root/paddlejob/workspace/env_run/DSVT/output/cfgs/dsvt_models/dsvt_plain_1f_onestage_nusences/default/ckpt/latest_model
2023-06-30 02:11:43,555   INFO  epoch: 14/20, acc_iter=112500, cur_iter=4363/7724, batch_size=4, time_cost(epoch): 1:05:40/50:35, time_cost(all): 16:05:36/10:31:49, loss=1.548808376789093, d_time=0.02(0.02), f_time=0.80(0.88), b_time=0.82(0.90), norm=29.192405700683594, lr=0.002131672879084205
/bin/sh: gpustat: command not found
2023-06-30 02:11:43,990   INFO  
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803000 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803001 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803057 milliseconds before timing out.
2023-06-30 09:52:59,721   INFO  Save latest model to /root/paddlejob/workspace/env_run/DSVT/output/cfgs/dsvt_models/dsvt_plain_1f_onestage_nusences/default/ckpt/latest_model
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803001 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803000 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803057 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 13116 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 13117) of binary: /usr/bin/python3.7
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-06-30_09:53:05
  host      : 10-67-245-145.local
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 13120)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 13120
[2]:
  time      : 2023-06-30_09:53:05
  host      : 10-67-245-145.local
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 13122)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 13122
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-30_09:53:05
  host      : 10-67-245-145.local
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 13117)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 13117
Wang0203 commented 1 year ago

Hello! Have you encountered this problem?

Haiyang-W commented 1 year ago

I think it's an accident. You can resume it.

Wang0203 commented 1 year ago

However, same problem, last night. 1a7ae85b6539f079e8303d633ee5f97e

Haiyang-W commented 1 year ago

I have never encountered such a problem, it must be the NCCL communication problem. Did you run on a separate 8-card machine?

Haiyang-W commented 1 year ago

I don't think it's the problem of DSVT. I suggest you check the machine or the environment first.