Cannot train with multi GPUs

Yablon commented 4 years ago

I clone the repository to my local server, then start to train on my own dataset.

I can run with one GPU, and the logs are as follwing:

FP16 Run: False
Dynamic Loss Scaling: False
Distributed Run: False
cuDNN Enabled: True
cuDNN Benchmark: False
Epoch: 0
/home/yablon/mellotron/yin.py:44: RuntimeWarning: invalid value encountered in true_divide
  cmndf = df[1:] * range(1, N) / np.cumsum(df[1:]).astype(float) #scipy method
Train loss 0 18.868097 Grad Norm 6.209010 19.63s/it
Validation loss 0: 63.929592
Saving model and optimizer state at iteration 0 to /home/yablon/training/mellotron/output/checkpoint_0
Train loss 1 39.906715 Grad Norm 18.103324 3.63s/it

But when I run with multi-GPUs, life becomes difficult for me.

The first problem is "apply_gradient_allreduce is not defined" error. OK, that's easy to fix, I just import it from distributed.

The coming problem is that the training seems to stop at the "Done initializing distributed", no more logs is printed further.

Can you fix this ? Thank you !

rafaelvalle commented 4 years ago

Pull from master and try again with FP16 enabled and disabled.

Yablon commented 4 years ago

Hi, rafaelvalle. I tried and it seems to be stuck here for a long time. I change nothing in the hparams but turn the "fp16_run" and "distributed_run" to be true.

FP16 Run: True
Dynamic Loss Scaling: False
Distributed Run: True
cuDNN Enabled: True
cuDNN Benchmark: False
Initializing Distributed
Done initializing distributed
Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic

rafaelvalle commented 4 years ago

Try with fp16_run=False

n5-suzuki commented 4 years ago

Hi, rafaelvalle. I also got a same error. I copied from newest code. And I modified distributed_run=True in hparams.py. Then I execute blow command. python train.py -o out_dir -l logdir -g

After a few minutes, below log appeared and it seemed to stop.

FP16 Run: False Dynamic Loss Scaling: True Distributed Run: True cuDNN Enabled: True cuDNN Benchmark: False Initializing Distributed Done initializing distributed

I checked my netword status with netstat -atno . Then I find "localhost:54321 LISTEN" and "localhost => localhost:54321". But process seems to stop...

pneumoman commented 4 years ago

@n5-suzuki : for multi-gpu you should be running multiproc. python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True

aijianiula0601 commented 4 years ago

I got the same error. It's the same problem for tacotron-pytorch.So sad!

Yablon commented 4 years ago

I think we can learn from this project and see how it is done to synthesis music rather than running this project. So I manually close this for lack of activity. If anybody has a solution, welcome to reopen and share it below.

NVIDIA / mellotron

Cannot train with multi GPUs #13