Gloo timeout when training on multi-GPU configurations

TheTrustedComputer commented 4 months ago

Describe the bug I have two 8GB AMD Radeon RX 5500 XTs for creating RVC models; it's nearly twice as fast as training on a single card. I greatly appreciate the support for distributed multi-GPU training setups. However, there's a potential communication hiccup between the processes, resulting in a deadlock and an interrupted session. Here's the runtime error output I saw after 30 minutes of inactivity:

Process Process-1:
Traceback (most recent call last):
  File "/home/thetrustedcomputer/Software/Python-3.10.13/Lib/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/thetrustedcomputer/Software/Python-3.10.13/Lib/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/thetrustedcomputer/Software/Git/RVC-Fumiama/infer/modules/train/train.py", line 278, in run
    train_and_evaluate(
  File "/home/thetrustedcomputer/Software/Git/RVC-Fumiama/infer/modules/train/train.py", line 508, in train_and_evaluate
    scaler.scale(loss_gen_all).backward()
  File "/home/thetrustedcomputer/Software/venv/RVC-Fumiama/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/thetrustedcomputer/Software/venv/RVC-Fumiama/lib/python3.10/site-packages/torch/autograd/__init__.py", line 199, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 1800000ms for recv operation to complete

To Reproduce

Open the web GUI, e.g. python gui.py.
Go to the training tab to create a model utilizing two or more GPUs, assuming you have the hardware. For example, 0-1 or 0-1-2. The GPU fans will ramp up.
Wait for them to spin down prematurely. Alternatively, wait until there's no further logging output. This is when the issue I described occurs.

Expected behavior Distributed training should continue without interruption until the last epoch or the user hits Ctrl+C.

Screenshots I've attached two screenshots from radeontop showing the expected and actual GPU usage.

Expected (two GPUs sharing the load, happened minutes after initial training): Screenshot_20240729_072231

Actual (one GPU at full load and the other idle, happened several hours later): Screenshot_20240729_071916

Desktop (please complete the following information):

OS and version: Arch Linux
Python version: 3.10.13
Commit/Tag with the issue: Latest

Additional context This isn't 100% reproducible due to the indeterministic nature of parallelism, so it's important to do multiple iterations to ensure it's absolutely fixed. I've tried changing the ROCm version (tested on 5.2.3 and 5.4.3) and got the same symptoms; the root cause may be within the training logic seen in the traceback. Although unlikely, it's also possible that the ROCm or PyTorch I'm using has broken concurrency libraries.

At one point, an epoch was completed in well over an hour, apparently using shared or system RAM instead of the GPU's VRAM, which is nowhere near full.

I'm uncertain if this error also affects NVIDIA GPUs. For those who have 2+ NVIDIA cards, please let us know if it applies to you.

fumiama commented 4 months ago

We will rewrite the whole training code later and we can see whether this problem can be solved or not.

For those who have 2+ NVIDIA cards, please let us know if it applies to you.

Agree.

charleswg commented 3 months ago

2 N A5000 cards have no issues, both GPU memory and usage used.

TheTrustedComputer commented 3 months ago

@charleswg Thank you for your insights. It appears NVIDIA cards aren't affected and may only apply to AMD.

fumiama / Retrieval-based-Voice-Conversion-WebUI

Gloo timeout when training on multi-GPU configurations #80