Closed MGSousa closed 2 years ago
Try without Anaconda and see if the training speed improves.
Although this repo works with Anaconda, we don't have enough developer interest to support it.
Will reopen issue if slow training speed is confirmed with a normal Python installation.
@MGSousa @StElysse Thanks for your observations in #711 which confirm the issue. Reopening this, no idea where to start since I don't have Windows to test.
I have version 10.0 of cuDNN installed, yes.
cuDNN 8.1 for CUDA v10.2.
cuDNN 8.1 for CUDA v10.2.
use cuDNN 7.5
cuDNN 8.1 for CUDA v10.2.
use cuDNN 7.5
Same result with r=2 ~42 steps
@MGSousa did it work out for you?
@Rainer2465 I changed outputs/step to four ( r=4 ~91 ) somewhat got decent results for now. But with r=2 the results remain the same (42 ~ 43 steps /s).
@MGSousa you're brazilian right? Can we talk a lil more about it? se for brasileiro podemos ver alguma forma de conversarmos sobre isso? Queria saber mais sobre o dataset de vozes para o treinamento
Nividia is providing a benchmark for Tacotron 2. Does it make sense to run it and compare the results with what we have with Real Time Voice Cloning during training, or does it compare apples with oranges ? The goal of that is to check whether the poor performance comes from Drivers / CUDA or from code or from Anaconda.
Experiment
So far (without running the benchmark) I have noticed that the GPU power consumption is very low during training. Indeed on my RTX 3070 power draw (nvidia-smi -q -d POWER | grep Draw
) shows around 60W (compared to 130W while mining or 100W while playing game). So I doubt GPU is used at its full capabilities.
Expected behaviour If the power limit is set to 130W I would expect the power draw to reach this limit while training since many users reports 100% utilization of their GPU during training. Behaviour should be similar to mining.
The command nvidia-smi -q -g 0 -d UTILIZATION -l 1
shows GPU usage varying between 20 and 50% during training. So GPU is far from being fully utilized whereas it should. So it looks like there is a huge gain margin to harvest somewhere.
I would like to use nvtop
to have a graph but it is not compatible with the latest nvidia drivers. If I force the install then there is a driver / library mismatch and the training runs even slower (less than 0.1 steps / s).
The repo includes a profiler which is used for encoder training. You can try something similar to find the bottleneck for synthesizer training.
Same problem here too, on Ubuntu 20.04 without Anaconda and on RTX 3070 under Python 3.8.8. Around 0.52-0.54 step / s.
Originally posted by @Ca-ressemble-a-du-fake in https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/711#issuecomment-960483401
With the whole repo in an unmodified state (including synthesizer hparams), I get 0.72-0.74 steps/s.
Switching to Tacotron2 on Tensorflow 1.x (5425557
, installed with these instructions), I get 1.10-1.14 steps/s. Tensorflow training speed is variable, with the training getting faster as the model gets better. This experiment was done with the #538 model as a starting point, finetuning on a large single-speaker dataset.
Because I wanted to understand whether this difference was caused by using Taco1 vs Taco2, or PT vs TF, I also ran a training experiment with a PyTorch Taco2 implementation. Results below.
Model | PT/TF | Model Parameters | Training Speed |
---|---|---|---|
Tacotron 1 | PyTorch 1.3.1 | 30.87M | 0.72-0.74 steps/s |
Tacotron 2 | PyTorch 1.3.1 | 28.44M | 0.65-0.66 steps/s |
Tacotron 2 | Tensorflow 1.15 | 28.44M | 1.10-1.14 steps/s |
We can conclude that PyTorch is slower training than Tensorflow 1.x. And there are some other unknown issues causing your training to be even slower.
* OS: Ubuntu 20.04
* NVIDIA Driver Version: 460.91.03
* CUDA Version: 11.2
* GPU: GTX 1660S (desktop)
* Performance state: P2
* Power draw: 90W (range: 70-110W)
* Utilization: 80% (range: 65-95%)
* VRAM: 5.5/6.0 GB
* Python 3.7.9 (with Anaconda)
* Same training speed obtained for Python 3.8.10 (without Anaconda) with torch==1.7.1
* Dataset storage: HDD
This is a nice benchmark. Which command did you use for GPU utilization ? And which dataset did you use ? RTX 3070 should be faster than GTX 1660s on the same dataset or it should be independent from it (my rate was given for a French dataset) ?
So can we say that around 1 step / s should be OK for r = 2 and batch_size = 12 ? Should the results be the same using vanilla Tacotron 2 training on same dataset ?
I just know basic Python (eg : basic string manipulation, basic external program call, ...). How do you use the profiler you talked about earlier ? I did not find any reference of it in encoder_train.py
,should I add a call in synthesizer_train.py
?
Which command did you use for GPU utilization ?
watch -n 0.5 nvidia-smi
How do you use the profiler you talked about earlier ?
I made a branch for this: https://github.com/blue-fish/Real-Time-Voice-Cloning/tree/700_slow_training If you don't want to get the branch, make these modifications: https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/989a3e43a834100ae76ac6ff7edd3e5d532ced80
Example training output with profiler:
{| Epoch: 1/8 (20/2564) | Loss: 6.107 | 0.75 steps/s | Step: 0k | }
Average execution time over 10 steps:
Blocking, waiting for batch (threaded) (10/10): mean: 0ms std: 0ms
Data to cuda (10/10): mean: 1ms std: 0ms
Forward pass (10/10): mean: 466ms std: 98ms
Loss (10/10): mean: 12ms std: 2ms
Backward pass (10/10): mean: 780ms std: 153ms
Parameter update (10/10): mean: 16ms std: 0ms
Extras (visualizations, saving) (10/10): mean: 4ms std: 0ms
Thanks a lot for these precise instructions! Forward and backward passes are higher than what you showed in your comment 2 days ago. Otherwise if your comment just above deals with a GTX 1660s then my profiler results look good since RTX 3070 should be faster.
{| Epoch: 1/61 (30/748) | Loss: 0.3490 | 1.2 steps/s | Step: 115k | }
Average execution time over 10 steps: Blocking, waiting for batch (threaded) (10/10): mean: 0ms std: 0ms Data to cuda (10/10): mean: 1ms std: 0ms Forward pass (10/10): mean: 292ms std: 64ms Loss (10/10): mean: 5ms std: 1ms Backward pass (10/10): mean: 454ms std: 101ms Parameter update (10/10): mean: 28ms std: 1ms Extras (visualizations, saving) (10/10): mean: 0ms std: 0ms
(Sorry I did not succeed in formatting the table as you did!
I thought your training rate was 0.52-0.54 steps/s, did something change? Your profiler results look reasonable to me.
It was indeed (for r = 2) but then I applied gradual training like r = 16 till 20k, then 8 till 40k, then 7 till 80k then 5 till 160k, and finally 2 till the end. Batch size is set constant to 12. Does it still look reasonable to you although r has increased ? In 10k, r will switch back to 2 so I'll post the profiler results to compare to.
GPU utilization shows roughly 40% (range 20-60%) and memory usage 5.7/7.8 GB (mainly used by python3 with 5.3GB) so there may be room for improvements. Maybe by increasing batch size ?
Now that r has decreased to 2 with a batch size of 12, training rate is back to 0.52-0.54 step / s
Backward pass takes around 1 s.
`{| Epoch: 2/208 (272/748) | Loss: 0.3196 | 0.56 steps/s | Step: 166k | }
Average execution time over 10 steps:
Blocking, waiting for batch (threaded) (10/10): mean: 0ms std: 0ms Data to cuda (10/10): mean: 1ms std: 0ms Forward pass (10/10): mean: 682ms std: 169ms Loss (10/10): mean: 13ms std: 4ms Backward pass (10/10): mean: 1060ms std: 259ms Parameter update (10/10): mean: 28ms std: 1ms Extras (visualizations, saving) (10/10): mean: 0ms std: 0ms`
GPU utilization is still around 40% (range 20-60%) and memory has increased to 6.7GB (only for training)
What is the GPU performance state reported by nvidia-smi
?
nvidia-smi
command reports performance state P2 being used. GPU utilization dropped a little bit falling even as low as 0% sometimes, but most of the time between 20-50%.
If GPU is running at P2, doesn't seem like it is the bottleneck. I am running out of ideas.
Nvidia driver is 470 (I tried several ones, this one is the proprietary one) Cuda is now 11.5 (I also tried 11.3). CPU does not seem to be the bottleneck since its utilization lives around 50% its an old i7 2600.
I tried to run PyTorch bottleneck as advised on PT forum but could not correctly modify train.py
so that PT profiler terminates in a given amount of time without messing up with the code. I tried to manually set max_step
to 254010 because training was at step 254k and I wanted it to train for 10 steps before exiting, but it made the Loss increase, so I reverted my changes.
I also tried to increase batch_size to 20 and quickly got a CUDA out of memory
error that advises something :
RuntimeError: CUDA out of memory. Tried to allocate 210.00 MiB (GPU 0; 7.79 GiB total capacity; 5.01 GiB already allocated; 166.25 MiB free; 5.29 GiB reserved in total by PyTorch) **If reserved memory is >> allocated memory** try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
It looks like reserved memory
> allocated memory
(but not >>) so should I apply their advice ?
Increasing batch_size to 16 did not cause the CUDA out of memory but did not noticeably increase GPU utilization. VRAM usage as increased up to 6.9GB.
I tried to run PyTorch bottleneck as advised on PT forum but could not correctly modify
train.py
so that PT profiler terminates in a given amount of time without messing up with the code.
First, update the training schedule in synthesizer/hparams.py so it only runs 20 steps from scratch.
### Tacotron Training
tts_schedule = [(2, 1e-3, 20, 12)], # Train only 20 steps for benchmarking
Next, you will need to change this line so num_workers=0
. The profiler doesn't work with multi-worker dataloaders.
https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/7432046efc23cabf176f9fdc8d2fd67020059478/synthesizer/train.py#L150
Train a new model from scratch. You can call it anything, I called mine "test". The --force_restart
option prevents the saved checkpoints from stopping training prematurely.
python -m torch.utils.bottleneck synthesizer_train.py --force_restart test datasets_root/SV2TTS/synthesizer/
Thank you @blue-fish for your guide! Unfortunately the computer is running out of ram after completing the second stage of the profiling (the one that involves Autograd). Neither GPU not CPU was more loaded than usual, but after the second stage completed, the RAM (and then swap space) skyrocketed and computer became unusable.
I tried with 20, 10, and even 5 steps failed because 12 GB of RAM were depleted. When computer become usable again I will try to profile for 2 steps only.
So I could profile the training for a single step only. The main differences I see with your results are :
But the profiling has only one step.
I don't know how you can achieve to format comments so well, I was not successful in doing so. Detailed results are barely legible consequently they cannot be posted.
PyTorch 1.10.0+cu113 DEBUG compiled w/ CUDA 11.3 Running with Python 3.8 and
pip3 list
truncated output:
numpy==1.19.4
torch==1.10.0+cu113
torchaudio==0.10.0+cu113
torchfile==0.1.0
torchvision==0.11.1+cu113
1846962 function calls (1806034 primitive calls) in 9.077 seconds
What should I look for / watch out in the results ?
Does your motherboard support PCI Express 3.0? If it only supports PCIe 2.0 then that is likely the bottleneck.
Some motherboards are manufactured with slots that fit a x16 GPU, but don't include all 16 PCIe lanes for communication.
@MGSousa @StElysse @Ca-ressemble-a-du-fake
For now, I am going to assume that "slow training" is caused by hardware being old or having limitations that prevent the GPU from operating at its full potential.
If this is not the case, please provide hardware details including:
@blue-fish thanks for your support. You are right according to nvidia X server settings
PCIe generation is only Gen2
(CPU i7 2600 does not support PCIe 3.0). Yet maximum PCIe link width
is well reported as x16
(5.0 GT/s).
Consequently you may be right old hardware could be the bottleneck for the GPU although CPU usage keeps around 12% while training, RAM usage is around 65% and PCIe Bandwidth Utilization
is reported as low as 1% by nvidia X server settings
. So no hardware seems to work at its full potential.
Just a follow-up on that topic. I tried to train the model with SIWIS dataset in French on Google Colab. Environment is as follows :
NVIDIA-SMI 495.44 Driver Version: 460.32.03 CUDA Version: 11.2
Intel(R) Xeon(R) CPU @ 2.30GHz
with 2 processors and around 13GB of RAM.While training with batch size of 32 and r = 7, I get on
I am quite surprised it is not way faster on such high end GPU (around 3 times faster than on my old setup [except for the GPU]).
Is there a way to know if code is dealing with floating point 16 or 32 operations, because I read somewhere that one mode brought a performance boost ? For now it is way too far from my current understanding of the topic!
@Rainer2465 Mudei as saídas/passo para quatro (r=4 ~91 ) um pouco tem resultados decentes por enquanto. Mas com r=2 os resultados permanecem os mesmos (42 ~ 43 passos /s).
@MGSousa @Rainer2465 vocês possuem algum modelo treinado hoje? podemos conversar?
Hi, I am training a dataset in Portuguese but the process is very slow using CUDA with default
hparams
.In this test I'm using batch_size = 8.
As Expected: Run 2~3 times faster
Anyone has this problem on Windows with Anaconda?