Closed planetrocke closed 2 years ago
I think this may be more relevant on the trainer repo, if so, please let me know. It seems as if the VITs training's parallel pieces need some work?
Update: I was able to get the training to work with 2 GPUs by commenting out the 1185-1189 lines pertaining to test_log. However, the GPUs acted a bit odd. The training was 50% longer per epoch, and the first card seemed to process while the second card just pegged at nearly 100% and didn't move.
Any thoughts on this? I definitely would like multi-GPU to work with VITS.
It's fixed in the latest version of Coqui Trainer.
Try installing https://github.com/coqui-ai/Trainer with pip3 install -U git+https://github.com/coqui-ai/Trainer
.
Oh sweet. Should I be using trainer by itself instead of as part of the TTS release?On Jun 14, 2022, at 8:59 PM, Akmal @.***> wrote: It's fixed in the latest version of Coqui Trainer. Try installing https://github.com/coqui-ai/Trainer with pip3 install -U git+https://github.com/coqui-ai/Trainer.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>
Yep, you need to uninstall trainer with pip and install the new one from github
Thank you both. The multi-GPU works now, however, I cannot figure out why the box keeps crashing. I have dual 3090s, I thought it might be overpowering, so I set a forced limit to 280, now I am trying on 250. I've also read where any PSU below platinum cannot handle the spikes in power. I know I can't use max power, as the PSU is a 1200W, but it isn't coming close to that. Any input on this from 30-series folks would be awesome. Thanks again.
OK, so even at 250 it fails (as in the system crashes, no logs, etc.). I'm hoping someone can provide input.
I guess the core problem is solved by reinstalling. I close this issue. Feel free to continue here on the discussions.
Describe the bug
Successfully trained this dataset with VITs on a single GPU. When I attempted to train it using multiple GPUs, it froze on STEP 0. The error is:
DistributedDataParallel' object has no attribute 'test_log'
To Reproduce
Run the following command:
python3 -m trainer.distribute --gpus=0,1 --script recipes/ljspeech.vits_tts/train_vits.py
Expected behavior
The model should train to x epochs with multiple GPU support.
Logs
Environment
Additional context
No response