Open divineSix opened 1 year ago
Not sure if you solved this yet (and I'm no expert in this!), but maybe you can try something like this?
python -m torch.distributed.launch --nproc_per_node 4 -m vall_e.train yaml=config/your_data/ar_or_nar.yml
Your batch size is 24. Did you try using a machine with more GPU memory (e.g. 48 Mb or even more)?
I'm trying to train the model on a subset of LibriTTS data. Once I've completed the quanitization steps (completing step 3 as per Readme), the model crashes because it runs out of GPU memory. I've attached the logs for it below.
Now, I'm using the same command as mentioned in the readme.md, so if there's another command to run the training script such that it leverages the multiple GPUs and the distributed processing, let me know what the command is so that I may try it out.