Performing Distributed Training

dreamwish1998 commented 10 months ago

How do we perform distributed training in this project? or how to modify the code for distributed training? Thank you very much!!!

XY-boy commented 10 months ago

How do we perform distributed training in this project? or how to modify the code for distributed training? Thank you very much!!!

I train it on a single GPU. Maybe you can set gpu_ids in the setting.yml to train it on multiple GPUs, e.g., gpu_ids: [0, 1, 2, 3]

CANIBBER commented 3 months ago

This code works well on DistributedParallel. Just setting gpu_ids in the setting.yml to train it on multiple GPUs, e.g., gpu_ids: [0, 1, 2, 3]could apply this kind of Single-threaded training method, however the DistributedDataParallel method has some problem. For me is the NCCL timeout issue matters the most. Usually it happens in random, and I am still trying to figure out what causes it.

[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13727, OpType=ALLREDUCE, NumelIn=282883, NumelOut=282883, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 13727, last enqueued NCCL work: 13729, last completed NCCL work: 13726.
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13727, OpType=ALLREDUCE, NumelIn=282883, NumelOut=282883, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe00e7b9897 in /root/anaconda3/envs/ediffsr/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fe00fa92c62 in /root/anaconda3/envs/ediffsr/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fe00fa97a80 in /root/anaconda3/envs/ediffsr/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fe00fa98dcc in /root/anaconda3/envs/ediffsr/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7fe067176bf4 in /root/anaconda3/envs/ediffsr/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7fe06a610609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fe06a535353 in /lib/x86_64-linux-gnu/libc.so.6)

W0810 23:49:29.625689 139791896203648 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 396 closing signal SIGTERM
/root/anaconda3/envs/ediffsr/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 22 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
E0810 23:49:29.797456 139791896203648 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 1 (pid: 397) of binary: /root/anaconda3/envs/ediffsr/bin/python
Traceback (most recent call last):
  File "/root/anaconda3/envs/ediffsr/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/anaconda3/envs/ediffsr/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/root/anaconda3/envs/ediffsr/lib/python3.9/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/root/anaconda3/envs/ediffsr/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/root/anaconda3/envs/ediffsr/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/anaconda3/envs/ediffsr/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
====================================================
train.py FAILED
----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-10_23:49:29
  host      : DESKTOP-LHPDM18.localdomain
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 397)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 397
====================================================
(ediffsr) root@DESKTOP-LHPDM18:~/program/EDiffSR/codes/config/sisr# /root/anaconda3/envs/ediffsr/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 21 leaked semaphore objects to clean up at shutdown

XY-boy / EDiffSR

Performing Distributed Training #7