Slow Training GPU RTX 2080

MGSousa commented 3 years ago

Hi, I am training a dataset in Portuguese but the process is very slow using CUDA with default hparams.

In this test I'm using batch_size = 8.

As Expected: Run 2~3 times faster

Anyone has this problem on Windows with Anaconda?

ghost commented 3 years ago

Try without Anaconda and see if the training speed improves.

Although this repo works with Anaconda, we don't have enough developer interest to support it.

ghost commented 3 years ago

Will reopen issue if slow training speed is confirmed with a normal Python installation.

ghost commented 3 years ago

@MGSousa @StElysse Thanks for your observations in #711 which confirm the issue. Reopening this, no idea where to start since I don't have Windows to test.

ghost commented 3 years ago

Do you have cuDNN installed?

StElysse commented 3 years ago

I have version 10.0 of cuDNN installed, yes.

MGSousa commented 3 years ago

cuDNN 8.1 for CUDA v10.2.

justinjohn0306 commented 3 years ago

cuDNN 8.1 for CUDA v10.2.

use cuDNN 7.5

justinjohn0306 commented 3 years ago

https://developer.nvidia.com/rdp/cudnn-archive

MGSousa commented 3 years ago

cuDNN 8.1 for CUDA v10.2.

use cuDNN 7.5

Same result with r=2 ~42 steps

Rainer2465 commented 3 years ago

@MGSousa did it work out for you?

MGSousa commented 3 years ago

@Rainer2465 I changed outputs/step to four ( r=4 ~91 ) somewhat got decent results for now. But with r=2 the results remain the same (42 ~ 43 steps /s).

Rainer2465 commented 3 years ago

@MGSousa you're brazilian right? Can we talk a lil more about it? se for brasileiro podemos ver alguma forma de conversarmos sobre isso? Queria saber mais sobre o dataset de vozes para o treinamento

Ca-ressemble-a-du-fake commented 2 years ago

Nividia is providing a benchmark for Tacotron 2. Does it make sense to run it and compare the results with what we have with Real Time Voice Cloning during training, or does it compare apples with oranges ? The goal of that is to check whether the poor performance comes from Drivers / CUDA or from code or from Anaconda.

Ca-ressemble-a-du-fake commented 2 years ago

Experiment So far (without running the benchmark) I have noticed that the GPU power consumption is very low during training. Indeed on my RTX 3070 power draw (nvidia-smi -q -d POWER | grep Draw) shows around 60W (compared to 130W while mining or 100W while playing game). So I doubt GPU is used at its full capabilities.

Expected behaviour If the power limit is set to 130W I would expect the power draw to reach this limit while training since many users reports 100% utilization of their GPU during training. Behaviour should be similar to mining.

Ca-ressemble-a-du-fake commented 2 years ago

The command nvidia-smi -q -g 0 -d UTILIZATION -l 1 shows GPU usage varying between 20 and 50% during training. So GPU is far from being fully utilized whereas it should. So it looks like there is a huge gain margin to harvest somewhere.

I would like to use nvtop to have a graph but it is not compatible with the latest nvidia drivers. If I force the install then there is a driver / library mismatch and the training runs even slower (less than 0.1 steps / s).

ghost commented 2 years ago

The repo includes a profiler which is used for encoder training. You can try something similar to find the bottleneck for synthesizer training.

ghost commented 2 years ago

Same problem here too, on Ubuntu 20.04 without Anaconda and on RTX 3070 under Python 3.8.8. Around 0.52-0.54 step / s.

Originally posted by @Ca-ressemble-a-du-fake in https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/711#issuecomment-960483401

With the whole repo in an unmodified state (including synthesizer hparams), I get 0.72-0.74 steps/s.

Switching to Tacotron2 on Tensorflow 1.x (5425557, installed with these instructions), I get 1.10-1.14 steps/s. Tensorflow training speed is variable, with the training getting faster as the model gets better. This experiment was done with the #538 model as a starting point, finetuning on a large single-speaker dataset.

Because I wanted to understand whether this difference was caused by using Taco1 vs Taco2, or PT vs TF, I also ran a training experiment with a PyTorch Taco2 implementation. Results below.

Model	PT/TF	Model Parameters	Training Speed
Tacotron 1	PyTorch 1.3.1	30.87M	0.72-0.74 steps/s
Tacotron 2	PyTorch 1.3.1	28.44M	0.65-0.66 steps/s
Tacotron 2	Tensorflow 1.15	28.44M	1.10-1.14 steps/s

We can conclude that PyTorch is slower training than Tensorflow 1.x. And there are some other unknown issues causing your training to be even slower.

* OS: Ubuntu 20.04
    * NVIDIA Driver Version: 460.91.03
    * CUDA Version: 11.2 

* GPU: GTX 1660S (desktop)
    * Performance state: P2
    * Power draw: 90W (range: 70-110W)
    * Utilization: 80% (range: 65-95%)
    * VRAM: 5.5/6.0 GB

* Python 3.7.9 (with Anaconda)
* Same training speed obtained for Python 3.8.10 (without Anaconda) with torch==1.7.1

* Dataset storage: HDD

Ca-ressemble-a-du-fake commented 2 years ago

This is a nice benchmark. Which command did you use for GPU utilization ? And which dataset did you use ? RTX 3070 should be faster than GTX 1660s on the same dataset or it should be independent from it (my rate was given for a French dataset) ?

So can we say that around 1 step / s should be OK for r = 2 and batch_size = 12 ? Should the results be the same using vanilla Tacotron 2 training on same dataset ?

I just know basic Python (eg : basic string manipulation, basic external program call, ...). How do you use the profiler you talked about earlier ? I did not find any reference of it in encoder_train.py,should I add a call in synthesizer_train.py ?

ghost commented 2 years ago

Which command did you use for GPU utilization ?

watch -n 0.5 nvidia-smi

How do you use the profiler you talked about earlier ?

I made a branch for this: https://github.com/blue-fish/Real-Time-Voice-Cloning/tree/700_slow_training If you don't want to get the branch, make these modifications: https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/989a3e43a834100ae76ac6ff7edd3e5d532ced80

Example training output with profiler:

{| Epoch: 1/8 (20/2564) | Loss: 6.107 | 0.75 steps/s | Step: 0k | }
Average execution time over 10 steps:
  Blocking, waiting for batch (threaded) (10/10):  mean:    0ms   std:    0ms
  Data to cuda (10/10):                            mean:    1ms   std:    0ms
  Forward pass (10/10):                            mean:  466ms   std:   98ms
  Loss (10/10):                                    mean:   12ms   std:    2ms
  Backward pass (10/10):                           mean:  780ms   std:  153ms
  Parameter update (10/10):                        mean:   16ms   std:    0ms
  Extras (visualizations, saving) (10/10):         mean:    4ms   std:    0ms

Ca-ressemble-a-du-fake commented 2 years ago

Thanks a lot for these precise instructions! Forward and backward passes are higher than what you showed in your comment 2 days ago. Otherwise if your comment just above deals with a GTX 1660s then my profiler results look good since RTX 3070 should be faster.

{| Epoch: 1/61 (30/748) | Loss: 0.3490 | 1.2 steps/s | Step: 115k | }

Average execution time over 10 steps: Blocking, waiting for batch (threaded) (10/10): mean: 0ms std: 0ms Data to cuda (10/10): mean: 1ms std: 0ms Forward pass (10/10): mean: 292ms std: 64ms Loss (10/10): mean: 5ms std: 1ms Backward pass (10/10): mean: 454ms std: 101ms Parameter update (10/10): mean: 28ms std: 1ms Extras (visualizations, saving) (10/10): mean: 0ms std: 0ms

(Sorry I did not succeed in formatting the table as you did!

ghost commented 2 years ago

I thought your training rate was 0.52-0.54 steps/s, did something change? Your profiler results look reasonable to me.

Ca-ressemble-a-du-fake commented 2 years ago

It was indeed (for r = 2) but then I applied gradual training like r = 16 till 20k, then 8 till 40k, then 7 till 80k then 5 till 160k, and finally 2 till the end. Batch size is set constant to 12. Does it still look reasonable to you although r has increased ? In 10k, r will switch back to 2 so I'll post the profiler results to compare to.

GPU utilization shows roughly 40% (range 20-60%) and memory usage 5.7/7.8 GB (mainly used by python3 with 5.3GB) so there may be room for improvements. Maybe by increasing batch size ?

Ca-ressemble-a-du-fake commented 2 years ago

Now that r has decreased to 2 with a batch size of 12, training rate is back to 0.52-0.54 step / s

Backward pass takes around 1 s.

`{| Epoch: 2/208 (272/748) | Loss: 0.3196 | 0.56 steps/s | Step: 166k | }

Average execution time over 10 steps:

Blocking, waiting for batch (threaded) (10/10): mean: 0ms std: 0ms Data to cuda (10/10): mean: 1ms std: 0ms Forward pass (10/10): mean: 682ms std: 169ms Loss (10/10): mean: 13ms std: 4ms Backward pass (10/10): mean: 1060ms std: 259ms Parameter update (10/10): mean: 28ms std: 1ms Extras (visualizations, saving) (10/10): mean: 0ms std: 0ms`

GPU utilization is still around 40% (range 20-60%) and memory has increased to 6.7GB (only for training)

ghost commented 2 years ago

What is the GPU performance state reported by nvidia-smi?

Ca-ressemble-a-du-fake commented 2 years ago

nvidia-smi command reports performance state P2 being used. GPU utilization dropped a little bit falling even as low as 0% sometimes, but most of the time between 20-50%.

ghost commented 2 years ago

If GPU is running at P2, doesn't seem like it is the bottleneck. I am running out of ideas.

Which NVIDIA driver and CUDA version is installed?
Is there any possibility of a CPU bottleneck? For example a low-performance or obsolete CPU.

Ca-ressemble-a-du-fake commented 2 years ago

Nvidia driver is 470 (I tried several ones, this one is the proprietary one) Cuda is now 11.5 (I also tried 11.3). CPU does not seem to be the bottleneck since its utilization lives around 50% its an old i7 2600.

Ca-ressemble-a-du-fake commented 2 years ago

I tried to run PyTorch bottleneck as advised on PT forum but could not correctly modify train.py so that PT profiler terminates in a given amount of time without messing up with the code. I tried to manually set max_step to 254010 because training was at step 254k and I wanted it to train for 10 steps before exiting, but it made the Loss increase, so I reverted my changes.

Ca-ressemble-a-du-fake commented 2 years ago

I also tried to increase batch_size to 20 and quickly got a CUDA out of memory error that advises something :

RuntimeError: CUDA out of memory. Tried to allocate 210.00 MiB (GPU 0; 7.79 GiB total capacity; 5.01 GiB already allocated; 166.25 MiB free; 5.29 GiB reserved in total by PyTorch) **If reserved memory is >> allocated memory** try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF It looks like reserved memory > allocated memory (but not >>) so should I apply their advice ?

Ca-ressemble-a-du-fake commented 2 years ago

Increasing batch_size to 16 did not cause the CUDA out of memory but did not noticeably increase GPU utilization. VRAM usage as increased up to 6.9GB.

ghost commented 2 years ago

I tried to run PyTorch bottleneck as advised on PT forum but could not correctly modify train.py so that PT profiler terminates in a given amount of time without messing up with the code.

Training schedule update

First, update the training schedule in synthesizer/hparams.py so it only runs 20 steps from scratch.

        ### Tacotron Training
        tts_schedule = [(2,  1e-3,  20,  12)],  # Train only 20 steps for benchmarking

Dataloader update

Next, you will need to change this line so num_workers=0. The profiler doesn't work with multi-worker dataloaders. https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/7432046efc23cabf176f9fdc8d2fd67020059478/synthesizer/train.py#L150

Command

Train a new model from scratch. You can call it anything, I called mine "test". The --force_restart option prevents the saved checkpoints from stopping training prematurely.

python -m torch.utils.bottleneck synthesizer_train.py --force_restart test datasets_root/SV2TTS/synthesizer/

Output

Click here to display profiler output

``` -------------------------------------------------------------------------------- Environment Summary -------------------------------------------------------------------------------- PyTorch 1.7.1 DEBUG compiled w/ CUDA 10.2 Running with Python 3.8 and `pip3 list` truncated output: numpy==1.19.4 torch==1.7.1 torchfile==0.1.0 -------------------------------------------------------------------------------- cProfile output -------------------------------------------------------------------------------- 6085394 function calls (5892177 primitive calls) in 43.148 seconds Ordered by: internal time List reduced from 7179 to 15 due to restriction <15> ncalls tottime percall cumtime percall filename:lineno(function) 20 15.886 0.794 15.886 0.794 {method 'run_backward' of 'torch._C._EngineBase' objects} 2992 8.177 0.003 8.177 0.003 {method 'read' of '_io.BufferedReader' objects} 552 3.517 0.006 3.518 0.006 {built-in method io.open} 5250 1.717 0.000 1.717 0.000 {method 'to' of 'torch._C._TensorBase' objects} 31000 1.018 0.000 1.018 0.000 {built-in method addmm} 12400 0.934 0.000 0.934 0.000 {built-in method lstm_cell} 6200 0.803 0.000 9.254 0.001 synthesizer/models/tacotron.py:270(forward) 480 0.709 0.001 0.710 0.001 {built-in method numpy.fromfile} 40 0.626 0.016 0.626 0.016 {built-in method gru} 19020 0.600 0.000 0.600 0.000 {method 'matmul' of 'torch._C._TensorBase' objects} 12400 0.494 0.000 1.292 0.000 synthesizer/models/tacotron.py:265(zoneout) 6200 0.491 0.000 2.612 0.000 synthesizer/models/tacotron.py:221(forward) 94620/20 0.463 0.000 10.396 0.520 venv/lib/python3.8/site-packages/torch/nn/modules/module.py:715(_call_impl) 6480 0.448 0.000 0.448 0.000 {built-in method conv1d} 18720 0.419 0.000 0.419 0.000 {built-in method cat} -------------------------------------------------------------------------------- autograd profiler output (CPU mode) -------------------------------------------------------------------------------- top 15 events sorted by cpu_time_total ----------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ----------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ TBackward 18.46% 556.886ms 18.46% 556.886ms 556.886ms 0.000us NaN 0.000us 0.000us 1 TBackward 9.39% 283.453ms 9.39% 283.453ms 283.453ms 0.000us NaN 0.000us 0.000us 1 aten::t 9.39% 283.451ms 9.39% 283.451ms 283.451ms 0.000us NaN 0.000us 0.000us 1 aten::conv1d 9.36% 282.482ms 9.36% 282.482ms 282.482ms 0.000us NaN 0.000us 0.000us 1 aten::convolution 9.36% 282.480ms 9.36% 282.480ms 282.480ms 0.000us NaN 0.000us 0.000us 1 aten::_convolution 9.36% 282.478ms 9.36% 282.478ms 282.478ms 0.000us NaN 0.000us 0.000us 1 aten::add_ 9.36% 282.389ms 9.36% 282.389ms 282.389ms 0.000us NaN 0.000us 0.000us 1 TBackward 4.73% 142.740ms 4.73% 142.740ms 142.740ms 0.000us NaN 0.000us 0.000us 1 aten::t 4.73% 142.737ms 4.73% 142.737ms 142.737ms 0.000us NaN 0.000us 0.000us 1 aten::transpose 4.65% 140.276ms 4.65% 140.276ms 140.276ms 0.000us NaN 0.000us 0.000us 1 BmmBackward0 2.35% 70.930ms 2.35% 70.930ms 70.930ms 0.000us NaN 0.000us 0.000us 1 aten::bmm 2.35% 70.904ms 2.35% 70.904ms 70.904ms 0.000us NaN 0.000us 0.000us 1 aten::to 2.33% 70.235ms 2.33% 70.235ms 70.235ms 0.000us NaN 0.000us 0.000us 1 aten::empty_strided 2.33% 70.190ms 2.33% 70.190ms 70.190ms 0.000us NaN 0.000us 0.000us 1 aten::gru 1.84% 55.546ms 1.84% 55.546ms 55.546ms 0.000us NaN 0.000us 0.000us 1 ----------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 3.017s CUDA time total: 0.000us -------------------------------------------------------------------------------- autograd profiler output (CUDA mode) -------------------------------------------------------------------------------- top 15 events sorted by cpu_time_total Because the autograd profiler uses the CUDA event API, the CUDA time column reports approximately max(cuda_time, cpu_time). Please ignore this output if your code does not use CUDA. ---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ AddmmBackward 21.06% 553.945ms 21.06% 553.945ms 553.945ms 553.600ms 27.44% 553.600ms 553.600ms 1 aten::mm 21.06% 553.915ms 21.06% 553.915ms 553.915ms 553.572ms 27.44% 553.572ms 553.572ms 1 UnsqueezeBackward0 10.92% 287.145ms 10.92% 287.145ms 287.145ms 286.460ms 14.20% 286.460ms 286.460ms 1 aten::squeeze 10.92% 287.135ms 10.92% 287.135ms 287.135ms 0.000us 0.00% 0.000us 0.000us 1 aten::addmm 6.06% 159.393ms 6.06% 159.393ms 159.393ms 159.392ms 7.90% 159.392ms 159.392ms 1 MulBackward0 5.76% 151.387ms 5.76% 151.387ms 151.387ms 150.768ms 7.47% 150.768ms 150.768ms 1 aten::dropout 3.20% 84.220ms 3.20% 84.220ms 84.220ms 84.218ms 4.17% 84.218ms 84.218ms 1 aten::_fused_dropout 3.20% 84.213ms 3.20% 84.213ms 84.213ms 84.214ms 4.17% 84.214ms 84.214ms 1 aten::stride 3.20% 84.148ms 3.20% 84.148ms 84.148ms 0.000us 0.00% 0.000us 0.000us 1 CudnnConvolutionBackward 2.90% 76.257ms 2.90% 76.257ms 76.257ms 70.970ms 3.52% 70.970ms 70.970ms 1 AddBackward0 2.59% 68.060ms 2.59% 68.060ms 68.060ms 68.057ms 3.37% 68.057ms 68.057ms 1 CatBackward 2.34% 61.523ms 2.34% 61.523ms 61.523ms 1.808ms 0.09% 1.808ms 1.808ms 1 CatBackward 2.31% 60.804ms 2.31% 60.804ms 60.804ms 1.012ms 0.05% 1.012ms 1.012ms 1 CatBackward 2.31% 60.761ms 2.31% 60.761ms 60.761ms 1.804ms 0.09% 1.804ms 1.804ms 1 CatBackward 2.19% 57.595ms 2.19% 57.595ms 57.595ms 1.712ms 0.08% 1.712ms 1.712ms 1 ---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 2.631s CUDA time total: 2.018s ```

Ca-ressemble-a-du-fake commented 2 years ago

Thank you @blue-fish for your guide! Unfortunately the computer is running out of ram after completing the second stage of the profiling (the one that involves Autograd). Neither GPU not CPU was more loaded than usual, but after the second stage completed, the RAM (and then swap space) skyrocketed and computer became unusable.

I tried with 20, 10, and even 5 steps failed because 12 GB of RAM were depleted. When computer become usable again I will try to profile for 2 steps only.

Ca-ressemble-a-du-fake commented 2 years ago

So I could profile the training for a single step only. The main differences I see with your results are :

different CUDA version (11.3 (although 11.5 in actually installed) VS 10.2)
CUDA total time > CPU total time (in your case it is the opposite)

But the profiling has only one step.

I don't know how you can achieve to format comments so well, I was not successful in doing so. Detailed results are barely legible consequently they cannot be posted.

Environment Summary

PyTorch 1.10.0+cu113 DEBUG compiled w/ CUDA 11.3 Running with Python 3.8 and

pip3 list truncated output: numpy==1.19.4 torch==1.10.0+cu113 torchaudio==0.10.0+cu113 torchfile==0.1.0 torchvision==0.11.1+cu113

cProfile output

     1846962 function calls (1806034 primitive calls) in 9.077 seconds

What should I look for / watch out in the results ?

ghost commented 2 years ago

Does your motherboard support PCI Express 3.0? If it only supports PCIe 2.0 then that is likely the bottleneck.

Some motherboards are manufactured with slots that fit a x16 GPU, but don't include all 16 PCIe lanes for communication.

ghost commented 2 years ago

@MGSousa @StElysse @Ca-ressemble-a-du-fake

For now, I am going to assume that "slow training" is caused by hardware being old or having limitations that prevent the GPU from operating at its full potential.

If this is not the case, please provide hardware details including:

Motherboard manufacturer and model (confirm PCIe 3.0/4.0 support, the PCIe x16 slot supports x16 bandwidth)
CPU model
RAM speed and amount

Ca-ressemble-a-du-fake commented 2 years ago

@blue-fish thanks for your support. You are right according to nvidia X server settings PCIe generation is only Gen2 (CPU i7 2600 does not support PCIe 3.0). Yet maximum PCIe link width is well reported as x16 (5.0 GT/s).

Consequently you may be right old hardware could be the bottleneck for the GPU although CPU usage keeps around 12% while training, RAM usage is around 65% and PCIe Bandwidth Utilization is reported as low as 1% by nvidia X server settings. So no hardware seems to work at its full potential.

Ca-ressemble-a-du-fake commented 2 years ago

Just a follow-up on that topic. I tried to train the model with SIWIS dataset in French on Google Colab. Environment is as follows :

Drivers : NVIDIA-SMI 495.44 Driver Version: 460.32.03 CUDA Version: 11.2
CPU is Intel(R) Xeon(R) CPU @ 2.30GHz with 2 processors and around 13GB of RAM.

While training with batch size of 32 and r = 7, I get on

K80 (around 11GB VRAM - Kepler arch from 2014) : 0,40 - 0,45 steps / s (I tried both Pytorch installs with CUDA 10.2 and CUDA 11.3 the latter being a little bit slower).
V100-SXM2 (around 16GB VRAM - Volta arch from 2017) : 1.4 steps / s (default install with Cuda 10.2)

I am quite surprised it is not way faster on such high end GPU (around 3 times faster than on my old setup [except for the GPU]).

Is there a way to know if code is dealing with floating point 16 or 32 operations, because I read somewhere that one mode brought a performance boost ? For now it is way too far from my current understanding of the topic!

jxnxts commented 2 years ago

@Rainer2465 Mudei as saídas/passo para quatro (r=4 ~91 ) um pouco tem resultados decentes por enquanto. Mas com r=2 os resultados permanecem os mesmos (42 ~ 43 passos /s).

@MGSousa @Rainer2465 vocês possuem algum modelo treinado hoje? podemos conversar?

CorentinJ / Real-Time-Voice-Cloning