CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.05k stars 8.71k forks source link

Can't train on two GPU's #664

Open zxmanxz opened 3 years ago

zxmanxz commented 3 years ago

Hi, when I tried to train synthesizer model on my laptop with 1 Nvidia 1650 GPU all was good but when I tried to run training process on my server with two Nvidia GeForce 1080Ti I got an error: ` ╰─ python synthesizer_train.py pretrained_new datasets/SV2TTS/synthesizer -s 50 -b 50 ─╯

Arguments: run_id: pretrained_new syn_dir: datasets/SV2TTS/synthesizer models_dir: synthesizer/saved_models/ save_every: 50 backup_every: 50 force_restart: False hparams:

Checkpoint path: synthesizer/saved_models/pretrained_new/pretrained_new.pt Loading training data from: datasets/SV2TTS/synthesizer/train.txt Using model: Tacotron Using device: cuda

Initialising Tacotron Model...

Trainable Parameters: 30.870M

Loading weights at synthesizer/saved_models/pretrained_new/pretrained_new.pt Tacotron weights loaded from step 0 Using inputs from: datasets/SV2TTS/synthesizer/train.txt datasets/SV2TTS/synthesizer/mels datasets/SV2TTS/synthesizer/embeds Found 259 samples +----------------+------------+---------------+------------------+ | Steps with r=2 | Batch Size | Learning Rate | Outputs/Step (r) | +----------------+------------+---------------+------------------+ | 20k Steps | 12 | 0.001 | 2 | +----------------+------------+---------------+------------------+

Traceback (most recent call last): File "synthesizer_train.py", line 35, in train(vars(args)) File "/home/roma/new_Real-Time-Voice-Cloning/Real-Time-Voice-Cloning/synthesizer/train.py", line 175, in train mels, embeds) File "/home/roma/new_Real-Time-Voice-Cloning/Real-Time-Voice-Cloning/synthesizer/utils/init.py", line 17, in data_parallel_workaround outputs = torch.nn.parallel.parallel_apply(replicas, inputs) File "/home/roma/anaconda3/envs/work/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/home/roma/anaconda3/envs/work/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise raise self.exc_type(msg) StopIteration: Caught StopIteration in replica 0 on device 0. Original Traceback (most recent call last): File "/home/roma/anaconda3/envs/work/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, *kwargs) File "/home/roma/anaconda3/envs/work/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/home/roma/new_Real-Time-Voice-Cloning/Real-Time-Voice-Cloning/synthesizer/models/tacotron.py", line 362, in forward device = next(self.parameters()).device # use same device as parameters StopIteration

`

ghost commented 3 years ago

My ability to help with this is limited, since I don't have a server with multiple GPUs to test.

Let's see if the data_parallel_workaround is not required. In synthesizer/train.py, try copying the code from line 177 over to line 174. https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/10ca8f7c785707f21c78cfe858a97841c1d875ba/synthesizer/train.py#L172-L177

zxmanxz commented 3 years ago

And paste where? Also I tried to set CUDA_VISIBLE_DEVICES=0 (to get only one GPU) but problem was the same...

ghost commented 3 years ago

If you get an identical message on a single GPU, then something is wrong because it shouldn't be executing the multi-GPU code.

Why don't you try setting CUDA_VISIBLE_DEVICES inside synthesizer_train.py? (This file is in the root of the repo, unlike train.py) See https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/489#issuecomment-673358005 . Then run the original, unmodified code. Paste the error message if you get one.

zxmanxz commented 3 years ago

It works fine with single GPU, mb you can give me an advise how to get full GPU usage (e.g. now it just using 4 GB and the other 7 are available)

ghost commented 3 years ago

To increase VRAM usage, adjust the batch size parameter (far right number) in hparams. https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/10ca8f7c785707f21c78cfe858a97841c1d875ba/synthesizer/hparams.py#L52-L57

zxmanxz commented 3 years ago

Thank you, if there would be the way to parallel computation between many GPU, it would be great.

ghost commented 3 years ago

@zxmanxz Try this branch for multi-GPU training. If it works I will submit a pull request. https://github.com/blue-fish/Real-Time-Voice-Cloning/tree/664_multi_gpu_training

ghost commented 3 years ago

@zxmanxz Can you let me know if the multi-GPU branch above works for you?

zxmanxz commented 3 years ago

Yes, I'll try to use multi GPU's later.

ghost commented 3 years ago

@zxmanxz When will you be able to test the multi-GPU training code? https://github.com/blue-fish/Real-Time-Voice-Cloning/tree/664_multi_gpu_training

chayan-agrawal commented 3 years ago

@blue-fish In the above mentioned code, we get another error at Line no 110 in synthesizer/train.py. torch.nn.modules.module.ModuleAttributeError: 'DataParallel' object has no attribute 'load'

ghost commented 3 years ago

@chayan-agrawal Which version of torch are you using? I'm using torch==1.7.1 and don't get that error.

chayan-agrawal commented 3 years ago

@blue-fish I am also using torch==1.7.1. Instead of model.load if used model.module.load it works on single GPU. Other GPUs are not in use.

ghost commented 3 years ago

@chayan-agrawal Thanks for suggesting that change. I have updated the code with your suggestion: https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/a90d2340c0d0c416bcec4da089a2c9ce3e4ed7d4

DataParallel also works in CPU and single GPU environment, so it is not necessary to check for multiple GPUs. It would be nice to get feedback on whether it works for multiple GPUs.

ghost commented 3 years ago

Before we even think of merging this code, we'll need to consider these issues:

chayan-agrawal commented 3 years ago

@blue-fish I have multiple GPUs on my system but it is working for only single GPU. Any help on how can I use multiple GPUs

ghost commented 3 years ago

@chayan-agrawal I don't have a multiple GPU environment to troubleshoot. All I can suggest is to ensure that Python sees both of your GPUs. For example, add this to the beginning of synthesizer_train.py to have it use the first and second GPUs.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
ghost commented 3 years ago

You might have to downgrade to torch==1.4.0 to get DataParallel to work.

Synergyst commented 3 years ago

You might have to downgrade to torch==1.4.0 to get DataParallel to work.

Hey, I'm having issues with this as well.. I feel it's just something stupid-simple I'm overlooking and is an easy fix though if you are able to get it working.. :)

The original repo worked fine with 1 GPU using Torch 1.6.0 before I installed a 2nd GPU to speed up training.. I am using torch 1.7.1 like you said was working for you. Torch version 1.4.0 does not actually run.

I have tried using both the CorentinJ and your blue-fish repo forks (yours being the one you had suggested which was the branch for multi-GPU support). The main repo does not run with Torch 1.4.0, 1.6.0, nor 1.7.1 unless I remove the second GPU from the system. Your repo branch I mentioned does work with the enviroment path override added and using Torch 1.7.1.. however it does not actually utilize the second GPU.

Is there a requirements.txt that you can provide for testing? Perhaps I have some other library installed which breaks this functionality? I'm grasping at straws at this point.. I have been working at it for days but to no avail.. Didn't want to post here until I felt that I needed assistance.

Kind regards.

fede-astolfi commented 3 years ago

i am having the exact same problem, has anyone solved it somehow?

linan06kuaishou commented 2 years ago

You might have to downgrade to torch==1.4.0 to get DataParallel to work.

As Synergyst mentioned, using torch version 1.4 dosen't work. The error i got is: "AttributeError: 'PosixPath' object has no attribute 'tell'" I googled it and find that to solve it i have to use torch version above 1.6. Awkward face...