coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
33.21k stars 4.02k forks source link

[Bug] Training XTTSv2 with DDP leads to weird training lags #3807

Open NikitaKononov opened 2 months ago

NikitaKononov commented 2 months ago

Describe the bug

Hello, training XTTSv2 leads to weird training lags with using DDP - training gets stuck with no errors x6 RTX a6000 and 512GB RAM

Here is monitoring GPU load graph. Purple - gpu0, green - gpu1 (all the rest GPUs behave like gpu1)

image

With 2 or 4 GPU situation remains the same

I think there's some kind of error in Trainer or in xtts scripts maybe my dataset is kinda large, 2000hrs of 1 language

To Reproduce

python -m trainer.distribute --script recipes/ljspeech/xtts_v2/train_gpt_xtts.py --gpus 0,1,2,3,4,5

Expected behavior

training must not get stuck

Logs

No response

Environment

tts version: latest

Additional context

No response

NikitaKononov commented 2 months ago

tried num_workers=0, >0, MP_THREADS_NUM and so on, nothing helps lots of ram and shared memory