erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
1.14k stars 118 forks source link

Training with ddp doesnt work hangs at init_process_group #305

Closed Eyalm321 closed 3 months ago

Eyalm321 commented 3 months ago

🔴 If you have installed AllTalk in a custom Python environment, I will only be able to provide limited assistance/support. AllTalk draws on a variety of scripts and libraries that are not written or managed by myself, and they may fail, error or give strange results in custom built python environments.

🔴 Please generate a diagnostics report and upload the "diagnostics.log" as this helps me understand your configuration.

https://github.com/erew123/alltalk_tts/tree/main?#-how-to-make-a-diagnostics-report-file

Installed on docker from Dockerfile + docker-compose up -d 4x nvidia 16gb v100 sxm2 192gb ram 44-core intel xeon 2699-v4

Describe the bug A clear and concise description of what the bug is.

Upon setting use_ddp=True in TrainerArgs, training hangs right at init_process_group from distributed.py

To Reproduce Steps to reproduce the behaviour: fresh docker as is, add use_ddp=True and run the training

Screenshots If applicable, add screenshots to help explain your problem.

Text/logs If applicable, copy/paste in your logs here from the console. it just hangs at using DDP, no error or even use of cpu gpu or memory

Desktop (please complete the following information): AllTalk was updated: [approx. date] 01 july 2024 Custom Python environment: [yes/no give details if yes] no Text-generation-webUI was updated: [approx. date]

Additional context Add any other context about the problem here.

erew123 commented 3 months ago

Hi @Eyalm321

Last I knew, the Coqui scripts, which run the backend of the training dont support multi-gpu training for the XTTS model https://github.com/coqui-ai/TTS/issues/3132#issuecomment-1798437571

Currently the backend Coqui scripts are being maintained here https://github.com/idiap/coqui-ai-TTS FYI, this person is NOT a member of Coqui nor worked for Coqui and as I understand, they do this in their free time, so I am unable to comment as to how quickly they may be able to respond.

They will be your best bet though for looking into and resolving multi-gpu capability with the training scripts.

Thanks