Closed Eyalm321 closed 3 months ago
Hi @Eyalm321
Last I knew, the Coqui scripts, which run the backend of the training dont support multi-gpu training for the XTTS model https://github.com/coqui-ai/TTS/issues/3132#issuecomment-1798437571
Currently the backend Coqui scripts are being maintained here https://github.com/idiap/coqui-ai-TTS FYI, this person is NOT a member of Coqui nor worked for Coqui and as I understand, they do this in their free time, so I am unable to comment as to how quickly they may be able to respond.
They will be your best bet though for looking into and resolving multi-gpu capability with the training scripts.
Thanks
🔴 If you have installed AllTalk in a custom Python environment, I will only be able to provide limited assistance/support. AllTalk draws on a variety of scripts and libraries that are not written or managed by myself, and they may fail, error or give strange results in custom built python environments.
🔴 Please generate a diagnostics report and upload the "diagnostics.log" as this helps me understand your configuration.
https://github.com/erew123/alltalk_tts/tree/main?#-how-to-make-a-diagnostics-report-file
Installed on docker from Dockerfile + docker-compose up -d 4x nvidia 16gb v100 sxm2 192gb ram 44-core intel xeon 2699-v4
Describe the bug A clear and concise description of what the bug is.
Upon setting use_ddp=True in TrainerArgs, training hangs right at init_process_group from distributed.py
To Reproduce Steps to reproduce the behaviour: fresh docker as is, add use_ddp=True and run the training
Screenshots If applicable, add screenshots to help explain your problem.
Text/logs If applicable, copy/paste in your logs here from the console. it just hangs at using DDP, no error or even use of cpu gpu or memory
Desktop (please complete the following information): AllTalk was updated: [approx. date] 01 july 2024 Custom Python environment: [yes/no give details if yes] no Text-generation-webUI was updated: [approx. date]
Additional context Add any other context about the problem here.