TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.79k stars 809 forks source link

Multi-band MelGAN training very slow. #723

Closed mapledxf closed 2 years ago

mapledxf commented 2 years ago

Hi,

I am trying to train a MB-MelGAN by using 3080 Ti. But it turns out that the training speed is really slow compared to the training on a Titan RTX with same code and config.

3080 TI: tensorflow-gpu: 2.7.0 cuda: 11.4 cudnn: 8.2.4 training speed: [train]: 1%| | 48772/4000000 [6:46:27<553:43:57, 1.98it/s]

Titan RTX: tensorflow-gpu: 2.3.0 cuda: 10.1 cudnn: 7.6.5 training speed: [train]: 0%| | 325/4000000 [00:53<100:06:43, 11.10it/s]

Both are using same training command: CUDA_VISIBLE_DEVICES=0 python examples/multiband_melgan/train_multiband_melgan.py \ --train-dir $out_dir/train/ \ --dev-dir $out_dir/valid/ \ --outdir $out_dir/train.multiband_melgan.v1/ \ --config ./examples/multiband_melgan/conf/multiband_melgan.vwm.v1.yaml \ --use-norm 1 \ --generator_mixed_precision 1 \ --resume ""

Both trainings are using same config (multiband_melgan.vwm.v1.yaml ):

This is the hyperparameter configuration file for Multi-Band MelGAN.

Please make sure this is adjusted for the Baker dataset. If you want to

apply to the other dataset, you might need to carefully change some parameters.

This configuration performs 1000k iters.

###########################################################

FEATURE EXTRACTION SETTING

########################################################### sampling_rate: 24000 hop_size: 300 # Hop size. format: "npy"

###########################################################

GENERATOR NETWORK ARCHITECTURE SETTING

########################################################### model_type: "multiband_melgan_generator"

multiband_melgan_generator_params: out_channels: 4 # Number of output channels (number of subbands). kernel_size: 7 # Kernel size of initial and final conv layers. filters: 384 # Initial number of channels for conv layers. upsample_scales: [3, 5, 5] # List of Upsampling scales. stack_kernel_size: 3 # Kernel size of dilated conv layers in residual stack. stacks: 4 # Number of stacks in a single residual stack module. is_weight_norm: false # Use weight-norm or not.

###########################################################

DISCRIMINATOR NETWORK ARCHITECTURE SETTING

########################################################### multiband_melgan_discriminator_params: out_channels: 1 # Number of output channels. scales: 3 # Number of multi-scales. downsample_pooling: "AveragePooling1D" # Pooling type for the input downsampling. downsample_pooling_params: # Parameters of the above pooling function. pool_size: 4 strides: 2 kernel_sizes: [5, 3] # List of kernel size. filters: 16 # Number of channels of the initial conv layer. max_downsample_filters: 512 # Maximum number of channels of downsampling layers. downsample_scales: [4, 4, 4] # List of downsampling scales. nonlinear_activation: "LeakyReLU" # Nonlinear activation function. nonlinear_activation_params: # Parameters of nonlinear activation function. alpha: 0.2 is_weight_norm: false # Use weight-norm or not.

###########################################################

STFT LOSS SETTING

########################################################### stft_loss_params: fft_lengths: [1024, 2048, 512] # List of FFT size for STFT-based loss. frame_steps: [120, 240, 50] # List of hop size for STFT-based loss frame_lengths: [600, 1200, 240] # List of window length for STFT-based loss.

subband_stft_loss_params: fft_lengths: [384, 683, 171] # List of FFT size for STFT-based loss. frame_steps: [30, 60, 10] # List of hop size for STFT-based loss frame_lengths: [150, 300, 60] # List of window length for STFT-based loss.

###########################################################

ADVERSARIAL LOSS SETTING

########################################################### lambda_feat_match: 10.0 # Loss balancing coefficient for feature matching loss lambda_adv: 2.5 # Loss balancing coefficient for adversarial loss.

###########################################################

DATA LOADER SETTING

########################################################### batch_size: 64 # Batch size for each GPU with assuming that gradient_accumulation_steps == 1. batch_max_steps: 9600 # Length of each audio in batch for training. Make sure dividable by hop_size. batch_max_steps_valid: 48000 # Length of each audio for validation. Make sure dividable by hope_size. remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps. allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory. is_shuffle: true # shuffle dataset after each epoch.

###########################################################

OPTIMIZER & SCHEDULER SETTING

########################################################### generator_optimizer_params: lr_fn: "PiecewiseConstantDecay" lr_params: boundaries: [100000, 200000, 300000, 400000, 500000, 600000, 700000] values: [0.001, 0.0005, 0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001] amsgrad: false

discriminator_optimizer_params: lr_fn: "PiecewiseConstantDecay" lr_params: boundaries: [100000, 200000, 300000, 400000, 500000] values: [0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001] amsgrad: false

gradient_accumulation_steps: 1 ###########################################################

INTERVAL SETTING

########################################################### discriminator_train_start_steps: 200000 # steps begin training discriminator train_max_steps: 4000000 # Number of training steps. save_interval_steps: 20000 # Interval steps to save checkpoint. eval_interval_steps: 5000 # Interval steps to evaluate the network. log_interval_steps: 200 # Interval steps to record the training log.

###########################################################

OTHER SETTING

########################################################### num_save_intermediate_results: 1 # Number of batch to be saved as intermediate results.

I also notice that the training speed is slower if I am using multi GPUs: with CUDA_VISIBLE_DEVICES=0,1: [train]: 0%| | 37/4000000 [00:42<127:02:17, 8.75it/s] with CUDA_VISIBLE_DEVICES=0: [train]: 0%| | 325/4000000 [00:53<100:06:43, 11.10it/s]

@dathudeptrai Any idea why the training speed is so slow on 3080 TI? And why it is slow when using multi GPUs?

dathudeptrai commented 2 years ago

@mapledxf About multi-GPU, the real batch_size is the batch_size in config file multiplied with the number of GPU so a bit slower is understandable because they need time to communicate with each other (for example aggregating gradients ...). I used to train mb-melgan on 3090 and 3080 and it was fine, i don't know if there is a problem with 3080Ti, maybe you need check the time inference first and the IO traffic

mapledxf commented 2 years ago

I have checked the IO traffic and it does not have any issue. I also tried tacotron2 and the training speeds are similar on both machine.

mapledxf commented 2 years ago

@dathudeptrai what is generator_mixed_precision used for? can I add this option for the second phase of Multi-band MelGAN?

i.e. python examples/multiband_melgan/train_multiband_melgan.py \ --train-dir $out_dir/train/ \ --dev-dir $out_dir/valid/ \ --outdir $out_dir/train.multiband_melgan.v1/ \ --config ./examples/multiband_melgan/conf/multiband_melgan.vwm.v1.yaml \ --use-norm 1 \ --generator_mixed_precision 1 \ --resume "" for the first phase, and

python examples/multiband_melgan/train_multiband_melgan.py \ --train-dir $out_dir/train/ \ --dev-dir $out_dir/valid/ \ --outdir $out_dir/train.multiband_melgan.v1/ \ --config ./examples/multiband_melgan/conf/multiband_melgan.vwm.v1.yaml \ --use-norm 1 \ --generator_mixed_precision 1 \ --resume $out_dir/train.multiband_melgan.v1/checkpoints/ckpt-200000

for the second phase?

dathudeptrai commented 2 years ago

@dathudeptrai what is generator_mixed_precision used for? can I add this option for the second phase of Multi-band MelGAN?

i.e. python examples/multiband_melgan/train_multiband_melgan.py --train-dir $out_dir/train/ --dev-dir $out_dir/valid/ --outdir $out_dir/train.multiband_melgan.v1/ --config ./examples/multiband_melgan/conf/multiband_melgan.vwm.v1.yaml --use-norm 1 --generator_mixed_precision 1 --resume "" for the first phase, and

python examples/multiband_melgan/train_multiband_melgan.py --train-dir $out_dir/train/ --dev-dir $out_dir/valid/ --outdir $out_dir/train.multiband_melgan.v1/ --config ./examples/multiband_melgan/conf/multiband_melgan.vwm.v1.yaml --use-norm 1 --generator_mixed_precision 1 --resume $out_dir/train.multiband_melgan.v1/checkpoints/ckpt-200000

for the second phase?

Yeah you can try but i think it make the training progress slower with unknow reason (maybe caused by discriminator)

ttsking commented 2 years ago

Maybe it does not cause by 3080 TI and Titan. I use nVidia docker to train MB-MelGAN. I found when i use nVidia docker 21.09 based on CUDA 11.4, the training become much slower than nVidia docker 21.06 based on CUDA 11.3.

Maybe there have some differences between CUDA 11.4 and 11.3

mapledxf commented 2 years ago

Maybe it does not cause by 3080 TI and Titan. I use nVidia docker to train MB-MelGAN. I found when i use nVidia docker 21.09 based on CUDA 11.4, the training become much slower than nVidia docker 21.06 based on CUDA 11.3.

Maybe there have some differences between CUDA 11.4 and 11.3

yeah, nvidia docker solved this problem, thx!