Closed Eyalm321 closed 1 week ago
I'm trying to use your trainer as part of XTTS2 model finetuning, it works well when passing use_ddp=false and setting CUDA_VISIBLE_DEVICES to single gpu but when changing var to 0,1,2,3 and use_ddp to true is wont work it will just hang
https://github.com/daswer123/xtts-finetune-webui
clone it , install set env to multiple gpu, set use_ddp to True and run webui dataset processing done on same app training on 1 gpu works as intended
init_process_group not to hang and start multiple processes
2024-08-12 03:57:34,317 - TTS.tts.datasets - INFO - Found 2373 files in /app/xtts-finetune-webui/finetune_models/dataset > Training Environment: | > Backend: Torch | > Mixed precision: False | > Precision: float32 | > Current device: 0 | > Num. of GPUs: 4 | > Num. of CPUs: 44 | > Num. of Torch Threads: 1 | > Torch seed: 1 | > Torch CUDNN: True | > Torch CUDNN deterministic: False | > Torch CUDNN benchmark: False | > Torch TF32 MatMul: False > Start Tensorboard: tensorboard --logdir=/app/xtts-finetune-webui/finetune_models/run/training/GPT_XTTS_FT-August-12-2024_03+57AM-abf3ed9 > Using PyTorch DDP and thats it, it just hangs, it doesnt open a process even.
- 0.0.36 Trainer - 2.1.0 - 2.4.0 Torch - Debian Docker container on RHEL host machine - CUDA 11.8 - 4x v100 sxm2 16gb - 2.1.0 from source, 2.4.0 from pip
No response
happens with alltalk library which uses your trainer aswell
Describe the bug
I'm trying to use your trainer as part of XTTS2 model finetuning, it works well when passing use_ddp=false and setting CUDA_VISIBLE_DEVICES to single gpu but when changing var to 0,1,2,3 and use_ddp to true is wont work it will just hang
To Reproduce
https://github.com/daswer123/xtts-finetune-webui
clone it , install set env to multiple gpu, set use_ddp to True and run webui dataset processing done on same app training on 1 gpu works as intended
Expected behavior
init_process_group not to hang and start multiple processes
Logs
Environment
Additional context
No response