Tacotron2 MultiGPU training is slower than Single GPU

TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)

https://tensorspeech.github.io/TensorFlowTTS/

Apache License 2.0

3.8k stars 810 forks source link

Tacotron2 MultiGPU training is slower than Single GPU #731

Closed abdullah-saal closed 2 years ago

abdullah-saal commented 2 years ago

The GPUs are NVIDIA GeForce GTX 1080 Ti Tried with 4 GPUs, got around 5s-6s/it

Tried 1 GPU, got around 4s/it, which is still slow, tried the recommended options like static_shapes and max_mel_length from the discussions, didn't get any speed up.

Any Idea what might be wrong?

abaddon-moriarty commented 2 years ago

We had the same issue.

1 GPU training was at 2.5s/it for tacotron2 and 3it/s for HiFi-GAN multi-gpu was at 5s/it and 3s/it respectively.

We finished training on a single GPU.

you might want to read this issue: https://github.com/TensorSpeech/TensorFlowTTS/issues/723#issuecomment-1000360082 dathudeptrai gives at least an explanation of why multi-GPU training can be slower :)

Hope this helps

abaddon-moriarty commented 2 years ago

Did you manage to find a solution to your problem ?

abdullah-saal commented 2 years ago

Nope, training with a single GPU :(

ZDisket commented 2 years ago

In terms of secs per it, training is slower because of the overhead from communicating from multiple GPUs, but the trick to maximize it is to increase the batch size per GPU which doesn't affect speed that much but greatly decreases time to epoch. In one training run I had with 4xV100s with 64 per GPU (that is, 256 total batch size), each iteration goes through 8x the dataset than a single V100/32bs, but the s/it is only 2.32x (3.06s/it vs 7.1s/it), therefore resulting in a modest 3.4x speed increase. You'll want to adjust the LR scheduler to avoid overfitting, as since each iteration is worth 8x more: for example, a model trained on 256 batch size for 2000 steps, is as if you trained the same for 16k steps on 32 batch size. I've tested and it works like that.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.