TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.8k stars 810 forks source link

Bad GPU performance training on dual 3090s [Windows] #714

Closed basbrs closed 2 years ago

basbrs commented 2 years ago

I'm trying to train a slightly modified version of Tacotron2. When training on PC 1 (one GPU: RTX 3080, 10 GB VRAM) I'm able to compute about one iteration every 4 seconds according to the tqdm training bar. This utilizes all of my VRAM and only works when I reduce the batch size parameter in the config yaml to 1. When training on PC 2 (dual RTX 3090, 24 GB VRAM each) however, I'm only able to compute at a rate of 5 seconds per iteration (at max!).

I cannot use the default TF Implementation of MirroredStrategy() (in strategy.py), because I'm using Win10 on both machines and the default cross_device_ops uses NCCL which doesn't work on Windows [see here]. I tried different approaches: tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.HierarchicalCopyAllReduce()) and tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.ReductionToOneDevice()) work, but are painfully slow (8s per iteration and worse), tf.distribute.MultiWorkerMirroredStrategy(communication_options=tf.distribute.experimental.CommunicationOptions(implementation=tf.distribute.experimental.CommunicationImplementation.RING)) also works and is a little quicker (~5s/iteration) but still does not even come close to the speed of the weaker 3080... Calculation on just one card (tf.distribute.OneDeviceStrategy(device="/gpu:1")) does improve speed to about 4.7s/iteration.

Another weird thing is that training slows down when I increse the batch size on the 3090s: with a batch size of 8 one iteration takes about 6 seconds, whereas it only takes 5/4.7 seconds with batch sizes of 4 and 1 respectively.

A third problem is with mixed precision: as I understand it, using it should reduce VRAM usage and speed up training (very optimistically by a factor of 2). Using the 3080 it runs out of memory however (whereas it would train normally not using MP) and on the 3090s it slows training by about 50% without reducing VRAM usage at all...

Utilization of either 3090 never exceeds ~15% (measured in task manager) or 20-30% (measured in nvidia-smi).

Hope someone can help, thanks :)

dathudeptrai commented 2 years ago

@basbrs We never test a code on Windows so we can't know how to fix those problems. About mixed-precision, the implementation of tacotron2 now can't speed-up training progress (don't know why and this problem existed for a long time)

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.