Open TheTrustedComputer opened 3 months ago
We will rewrite the whole training code later and we can see whether this problem can be solved or not.
For those who have 2+ NVIDIA cards, please let us know if it applies to you.
Agree.
2 N A5000 cards have no issues, both GPU memory and usage used.
@charleswg Thank you for your insights. It appears NVIDIA cards aren't affected and may only apply to AMD.
Describe the bug I have two 8GB AMD Radeon RX 5500 XTs for creating RVC models; it's nearly twice as fast as training on a single card. I greatly appreciate the support for distributed multi-GPU training setups. However, there's a potential communication hiccup between the processes, resulting in a deadlock and an interrupted session. Here's the runtime error output I saw after 30 minutes of inactivity:
To Reproduce
python gui.py
.0-1
or0-1-2
. The GPU fans will ramp up.Expected behavior Distributed training should continue without interruption until the last epoch or the user hits Ctrl+C.
Screenshots I've attached two screenshots from
radeontop
showing the expected and actual GPU usage.Expected (two GPUs sharing the load, happened minutes after initial training):
Actual (one GPU at full load and the other idle, happened several hours later):
Desktop (please complete the following information):
Additional context This isn't 100% reproducible due to the indeterministic nature of parallelism, so it's important to do multiple iterations to ensure it's absolutely fixed. I've tried changing the ROCm version (tested on 5.2.3 and 5.4.3) and got the same symptoms; the root cause may be within the training logic seen in the traceback. Although unlikely, it's also possible that the ROCm or PyTorch I'm using has broken concurrency libraries.
At one point, an epoch was completed in well over an hour, apparently using shared or system RAM instead of the GPU's VRAM, which is nowhere near full.
I'm uncertain if this error also affects NVIDIA GPUs. For those who have 2+ NVIDIA cards, please let us know if it applies to you.