coqui-ai / Trainer

🐸 - A general purpose model trainer, as flexible as it gets
184 stars 109 forks source link

[Bug] init_process_group hangs , not able to use DDP or Accelerate #148

Closed Eyalm321 closed 1 week ago

Eyalm321 commented 1 month ago

Describe the bug

I'm trying to use your trainer as part of XTTS2 model finetuning, it works well when passing use_ddp=false and setting CUDA_VISIBLE_DEVICES to single gpu but when changing var to 0,1,2,3 and use_ddp to true is wont work it will just hang

To Reproduce

https://github.com/daswer123/xtts-finetune-webui

clone it , install set env to multiple gpu, set use_ddp to True and run webui dataset processing done on same app training on 1 gpu works as intended

Expected behavior

init_process_group not to hang and start multiple processes

Logs

2024-08-12 03:57:34,317 - TTS.tts.datasets - INFO - Found 2373 files in /app/xtts-finetune-webui/finetune_models/dataset
 > Training Environment:
 | > Backend: Torch
 | > Mixed precision: False
 | > Precision: float32
 | > Current device: 0
 | > Num. of GPUs: 4
 | > Num. of CPUs: 44
 | > Num. of Torch Threads: 1
 | > Torch seed: 1
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 | > Torch TF32 MatMul: False
 > Start Tensorboard: tensorboard --logdir=/app/xtts-finetune-webui/finetune_models/run/training/GPT_XTTS_FT-August-12-2024_03+57AM-abf3ed9
 > Using PyTorch DDP

and thats it, it just hangs, it doesnt open a process even.

Environment

- 0.0.36 Trainer
- 2.1.0 - 2.4.0 Torch
- Debian Docker container on RHEL host machine
- CUDA 11.8
- 4x v100 sxm2 16gb 
- 2.1.0 from source, 2.4.0 from pip

Additional context

No response

Eyalm321 commented 1 month ago

happens with alltalk library which uses your trainer aswell