[Bug] Training XTTSv2 with DDP leads to weird training lags

NikitaKononov commented 4 months ago

Describe the bug

Hello, training XTTSv2 leads to weird training lags with using DDP - training gets stuck with no errors x6 RTX a6000 and 512GB RAM

Here is monitoring GPU load graph. Purple - gpu0, green - gpu1 (all the rest GPUs behave like gpu1)

With 2 or 4 GPU situation remains the same

I think there's some kind of error in Trainer or in xtts scripts maybe my dataset is kinda large, 2000hrs of 1 language

To Reproduce

python -m trainer.distribute --script recipes/ljspeech/xtts_v2/train_gpt_xtts.py --gpus 0,1,2,3,4,5

Expected behavior

training must not get stuck

Logs

No response

Environment

tts version: latest

Additional context

No response

NikitaKononov commented 4 months ago

tried num_workers=0, >0, MP_THREADS_NUM and so on, nothing helps lots of ram and shared memory

stale[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

coqui-ai / TTS