Open NikitaKononov opened 4 months ago
tried num_workers=0, >0, MP_THREADS_NUM and so on, nothing helps lots of ram and shared memory
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.
Describe the bug
Hello, training XTTSv2 leads to weird training lags with using DDP - training gets stuck with no errors x6 RTX a6000 and 512GB RAM
Here is monitoring GPU load graph. Purple - gpu0, green - gpu1 (all the rest GPUs behave like gpu1)
With 2 or 4 GPU situation remains the same
I think there's some kind of error in Trainer or in xtts scripts maybe my dataset is kinda large, 2000hrs of 1 language
To Reproduce
python -m trainer.distribute --script recipes/ljspeech/xtts_v2/train_gpt_xtts.py --gpus 0,1,2,3,4,5
Expected behavior
training must not get stuck
Logs
No response
Environment
Additional context
No response