archie1993 commented 1 year ago

Hi authors,

I am training the AudioDec model from scratch, with a 16 kHz dataset; each file in the dataset is around 20 seconds. I modified the hyper params as mentioned in this thread. As I start training, I observe that the perplexity starts increasing almost immediately. The vqloss is also steadily increasing, as can be seen from the following logs.

AutoEncoder Training
Configuration file=config/autoencoder/symAD_vctk_48000_hop300.yaml
2023-06-28 10:54:59,796 (train:47) INFO: device: gpu
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] sampling_rate = 16000
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] data = {'path': '/mnt/resource_nvme/segmented_20s', 'subset': {'train': 'clean_trainset_84spk_wav', 'valid': 'clean_validset_84spk_wav', 'test': 'clean_testset_wav'}}
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] model_type = symAudioDec
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] train_mode = autoencoder
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] paradigm = efficient
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] generator_params = {'input_channels': 1, 'output_channels': 1, 'encode_channels': 32, 'decode_channels': 32, 'code_dim': 64, 'codebook_num': 8, 'codebook_size': 1024, 'bias': True, 'enc_ratios': [2, 4, 8, 16], 'dec_ratios': [16, 8, 4, 2], 'enc_strides': [2, 4, 5, 5], 'dec_strides': [5, 5, 4, 2], 'mode': 'causal', 'codec': 'audiodec', 'projector': 'conv1d', 'quantier': 'residual_vq'}
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] discriminator_params = {'scales': 3, 'scale_downsample_pooling': 'AvgPool1d', 'scale_downsample_pooling_params': {'kernel_size': 4, 'stride': 2, 'padding': 2}, 'scale_discriminator_params': {'in_channels': 1, 'out_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'channels': 128, 'max_downsample_channels': 1024, 'max_groups': 16, 'bias': True, 'downsample_scales': [4, 4, 4, 4, 1], 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}}, 'follow_official_norm': True, 'periods': [2, 3, 5, 7, 11], 'period_discriminator_params': {'in_channels': 1, 'out_channels': 1, 'kernel_sizes': [5, 3], 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'max_downsample_channels': 1024, 'bias': True, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'use_weight_norm': True, 'use_spectral_norm': False}}
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] use_mel_loss = True
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] mel_loss_params = {'fs': 16000, 'fft_sizes': [2048], 'hop_sizes': [200], 'win_lengths': [2048], 'window': 'hann_window', 'num_mels': 80, 'fmin': 0, 'fmax': 7600, 'log_base': None}
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] use_stft_loss = False
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] stft_loss_params = {'fft_sizes': [1024, 2048, 512], 'hop_sizes': [120, 240, 50], 'win_lengths': [600, 1200, 240], 'window': 'hann_window'}
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] use_shape_loss = False
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] shape_loss_params = {'winlen': [300]}
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] generator_adv_loss_params = {'average_by_discriminators': False}
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] discriminator_adv_loss_params = {'average_by_discriminators': False}
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] use_feat_match_loss = True
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] feat_match_loss_params = {'average_by_discriminators': False, 'average_by_layers': False, 'include_final_outputs': False}
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] lambda_adv = 1.0
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] lambda_feat_match = 2.0
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] lambda_vq_loss = 1.0
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] lambda_mel_loss = 45.0
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] lambda_stft_loss = 45.0
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] lambda_shape_loss = 45.0
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] batch_size = 2
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] batch_length = 64000
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] adv_batch_length = 9600
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] pin_memory = True
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] num_workers = 96
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] generator_optimizer_type = Adam
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] generator_optimizer_params = {'lr': 0.0001, 'betas': [0.5, 0.9], 'weight_decay': 0.0}
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] generator_scheduler_type = StepLR
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] generator_scheduler_params = {'step_size': 200000, 'gamma': 1.0}
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] generator_grad_norm = -1
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] discriminator_optimizer_type = Adam
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] discriminator_optimizer_params = {'lr': 0.0002, 'betas': [0.5, 0.9], 'weight_decay': 0.0}
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] discriminator_scheduler_type = MultiStepLR
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] discriminator_scheduler_params = {'gamma': 0.5, 'milestones': [200000, 400000, 600000, 800000]}
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] discriminator_grad_norm = -1
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] start_steps = {'generator': 0, 'discriminator': 200000}
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] train_max_steps = 200000
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] adv_train_max_steps = 700000
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] save_interval_steps = 100000
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] eval_interval_steps = 1000
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] log_interval_steps = 100
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] config = config/autoencoder/symAD_vctk_48000_hop300.yaml
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] tag = autoencoder/symAD_vctk_48000_hop300
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] exp_root = exp
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] resume = 
2023-06-28 10:54:59,838 (train:66) INFO: [TrainGAN] seed = 1337
2023-06-28 10:54:59,838 (train:66) INFO: [TrainGAN] disable_cudnn = False
2023-06-28 10:54:59,838 (train:66) INFO: [TrainGAN] outdir = exp/autoencoder/symAD_vctk_48000_hop300
2023-06-28 10:54:59,838 (codecTrain:49) INFO: Loading datasets... (batch_lenght: 64000)
2023-06-28 10:54:59,860 (codecTrain:62) INFO: The number of training files = 5638.
2023-06-28 10:54:59,860 (codecTrain:63) INFO: The number of validation files = 5638.
2023-06-28 10:55:02,164 (codecTrain:249) INFO: Train from scratch
2023-06-28 10:55:02,164 (train:108) INFO: The current training step: 0
[train]:   0%|                                                                                                  | 97/200000 [00:08<2:08:56, 25.84it/s]2023-06-28 10:55:11,056 (trainerGAN:333) INFO: (Steps: 100) train/train/ppl_0 = 1.0378.
2023-06-28 10:55:11,056 (trainerGAN:333) INFO: (Steps: 100) train/train/ppl_1 = 1.3916.
2023-06-28 10:55:11,056 (trainerGAN:333) INFO: (Steps: 100) train/train/ppl_2 = 2.0831.
2023-06-28 10:55:11,056 (trainerGAN:333) INFO: (Steps: 100) train/train/ppl_3 = 3.9071.
2023-06-28 10:55:11,056 (trainerGAN:333) INFO: (Steps: 100) train/train/ppl_4 = 5.6672.
2023-06-28 10:55:11,056 (trainerGAN:333) INFO: (Steps: 100) train/train/ppl_5 = 5.7495.
2023-06-28 10:55:11,056 (trainerGAN:333) INFO: (Steps: 100) train/train/ppl_6 = 6.8262.
2023-06-28 10:55:11,056 (trainerGAN:333) INFO: (Steps: 100) train/train/ppl_7 = 6.5784.
2023-06-28 10:55:11,056 (trainerGAN:333) INFO: (Steps: 100) train/train/vqloss = 1.1645.
2023-06-28 10:55:11,056 (trainerGAN:333) INFO: (Steps: 100) train/mel_loss = 69.3261.
2023-06-28 10:55:11,056 (trainerGAN:333) INFO: (Steps: 100) train/generator_loss = 70.4907.
[train]:   0%|                                                                                                 | 197/200000 [00:11<1:39:00, 33.63it/s]2023-06-28 10:55:14,023 (trainerGAN:333) INFO: (Steps: 200) train/train/ppl_0 = 12.0950.
2023-06-28 10:55:14,023 (trainerGAN:333) INFO: (Steps: 200) train/train/ppl_1 = 71.2452.
2023-06-28 10:55:14,023 (trainerGAN:333) INFO: (Steps: 200) train/train/ppl_2 = 117.6926.
2023-06-28 10:55:14,023 (trainerGAN:333) INFO: (Steps: 200) train/train/ppl_3 = 93.9345.
2023-06-28 10:55:14,023 (trainerGAN:333) INFO: (Steps: 200) train/train/ppl_4 = 72.0587.
2023-06-28 10:55:14,023 (trainerGAN:333) INFO: (Steps: 200) train/train/ppl_5 = 60.1578.
2023-06-28 10:55:14,023 (trainerGAN:333) INFO: (Steps: 200) train/train/ppl_6 = 57.6502.
2023-06-28 10:55:14,023 (trainerGAN:333) INFO: (Steps: 200) train/train/ppl_7 = 45.2416.
2023-06-28 10:55:14,023 (trainerGAN:333) INFO: (Steps: 200) train/train/vqloss = 0.0397.
2023-06-28 10:55:14,023 (trainerGAN:333) INFO: (Steps: 200) train/mel_loss = 60.2521.
2023-06-28 10:55:14,023 (trainerGAN:333) INFO: (Steps: 200) train/generator_loss = 60.2918.
[train]:   0%|▏                                                                                                | 297/200000 [00:14<1:37:34, 34.11it/s]2023-06-28 10:55:16,952 (trainerGAN:333) INFO: (Steps: 300) train/train/ppl_0 = 42.3088.
2023-06-28 10:55:16,952 (trainerGAN:333) INFO: (Steps: 300) train/train/ppl_1 = 101.7872.
2023-06-28 10:55:16,952 (trainerGAN:333) INFO: (Steps: 300) train/train/ppl_2 = 115.1603.
2023-06-28 10:55:16,952 (trainerGAN:333) INFO: (Steps: 300) train/train/ppl_3 = 39.3077.
2023-06-28 10:55:16,952 (trainerGAN:333) INFO: (Steps: 300) train/train/ppl_4 = 29.4465.
2023-06-28 10:55:16,952 (trainerGAN:333) INFO: (Steps: 300) train/train/ppl_5 = 25.6839.
2023-06-28 10:55:16,952 (trainerGAN:333) INFO: (Steps: 300) train/train/ppl_6 = 25.1760.
2023-06-28 10:55:16,952 (trainerGAN:333) INFO: (Steps: 300) train/train/ppl_7 = 23.8064.
2023-06-28 10:55:16,952 (trainerGAN:333) INFO: (Steps: 300) train/train/vqloss = 0.0471.
2023-06-28 10:55:16,953 (trainerGAN:333) INFO: (Steps: 300) train/mel_loss = 50.7779.
2023-06-28 10:55:16,953 (trainerGAN:333) INFO: (Steps: 300) train/generator_loss = 50.8251.
[train]:   0%|▏                                                                                                | 397/200000 [00:17<1:37:39, 34.07it/s]2023-06-28 10:55:19,917 (trainerGAN:333) INFO: (Steps: 400) train/train/ppl_0 = 65.0666.
2023-06-28 10:55:19,917 (trainerGAN:333) INFO: (Steps: 400) train/train/ppl_1 = 96.9628.
2023-06-28 10:55:19,917 (trainerGAN:333) INFO: (Steps: 400) train/train/ppl_2 = 102.8317.
2023-06-28 10:55:19,917 (trainerGAN:333) INFO: (Steps: 400) train/train/ppl_3 = 31.8549.
2023-06-28 10:55:19,917 (trainerGAN:333) INFO: (Steps: 400) train/train/ppl_4 = 29.1392.
2023-06-28 10:55:19,917 (trainerGAN:333) INFO: (Steps: 400) train/train/ppl_5 = 26.5810.
2023-06-28 10:55:19,917 (trainerGAN:333) INFO: (Steps: 400) train/train/ppl_6 = 24.8492.
2023-06-28 10:55:19,917 (trainerGAN:333) INFO: (Steps: 400) train/train/ppl_7 = 22.6607.
2023-06-28 10:55:19,917 (trainerGAN:333) INFO: (Steps: 400) train/train/vqloss = 0.0356.
2023-06-28 10:55:19,917 (trainerGAN:333) INFO: (Steps: 400) train/mel_loss = 48.2285.
2023-06-28 10:55:19,917 (trainerGAN:333) INFO: (Steps: 400) train/generator_loss = 48.2641.
[train]:   0%|▏                                                                                                | 497/200000 [00:20<1:38:37, 33.71it/s]2023-06-28 10:55:22,886 (trainerGAN:333) INFO: (Steps: 500) train/train/ppl_0 = 90.1403.
2023-06-28 10:55:22,887 (trainerGAN:333) INFO: (Steps: 500) train/train/ppl_1 = 111.8928.
2023-06-28 10:55:22,887 (trainerGAN:333) INFO: (Steps: 500) train/train/ppl_2 = 109.8676.
2023-06-28 10:55:22,887 (trainerGAN:333) INFO: (Steps: 500) train/train/ppl_3 = 31.8954.
2023-06-28 10:55:22,887 (trainerGAN:333) INFO: (Steps: 500) train/train/ppl_4 = 30.4388.
2023-06-28 10:55:22,887 (trainerGAN:333) INFO: (Steps: 500) train/train/ppl_5 = 26.5104.
2023-06-28 10:55:22,887 (trainerGAN:333) INFO: (Steps: 500) train/train/ppl_6 = 24.2304.
2023-06-28 10:55:22,887 (trainerGAN:333) INFO: (Steps: 500) train/train/ppl_7 = 22.0437.
2023-06-28 10:55:22,887 (trainerGAN:333) INFO: (Steps: 500) train/train/vqloss = 0.0557.
2023-06-28 10:55:22,887 (trainerGAN:333) INFO: (Steps: 500) train/mel_loss = 42.6582.
2023-06-28 10:55:22,887 (trainerGAN:333) INFO: (Steps: 500) train/generator_loss = 42.7138.

I tried reducing the batch size and learning rate, but that did not help. Do you have any idea why this may be happening?

bigpon commented 1 year ago

Hi, the perplexity is related to the utilization of the codebook. The higher the perplexity, the better the network utilizes the codebook. Therefore, the increasing perplexity during training is reasonable.

The vq_loss usually increases for a while and then fluctuates in a specific range.

During training, the most important value might be the mel_loss. If the mel_loss continually decreases while the model training with only the metric loss, the training might be on track.

archie1993 commented 1 year ago

Hi Yi, thanks for the clarification. Do you know what is a good target for the train/eval mel_loss to be able to say that we have trained a decent model?

Also, I had a couple of other questions:

How much time did it take to train the LibriTTS model on 1 GPU?
Let's say that my dataset is 10x bigger than LibriTTS. Should we be changing the number of iterations of stage 1 and stage 2 (200k and 500k), or any params to account for the increased data?

bigpon commented 1 year ago

Hi,

The final me_loss varies from corpus to corpus. According to our results, the final mel_loss of VCTK is around 17 and LibriTTS is around 22.
The training time depends on the number of iterations and batch_size. With batch_size 0.2 seconds, 200k for metric-only training and another 500k for metric and adv training take around 2 days using an A100 GPU.
According to our experience, the iteration number of the stage 1 training should be increased according to the amount of training data, but it is fine to keep the same iteration number of the stage 2 training for different corpora. Therefore, we empirically set the stage 1 iteration number to 200k for VCTK and 500k for LibriTTS.

archie1993 commented 1 year ago

Hey Yi

What are the more important metrics to focus on for stage 2 training? Can you also share the optimal values you observed for them at convergence?

Thanks!

bigpon commented 1 year ago

Hi, In stage 2, the model sometimes will suffer the mode collapse issue, and the vq_loss and mel_loss will become much higher. If the model is on track, the mel_loss and vq_loss will only slightly increase for around 1-2, so it is better to monitor the mel_loss and vq_loss during the stage 2 training. Furthermore, in most cases, the real_loss is usually similar to the fake_loss in a stable GAN model.

lixinghe1999 commented 7 months ago

Hi, the perplexity is related to the utilization of the codebook. The higher the perplexity, the better the network utilizes the codebook. Therefore, the increasing perplexity during training is reasonable.

The vq_loss usually increases for a while and then fluctuates in a specific range.

During training, the most important value might be the mel_loss. If the mel_loss continually decreases while the model training with only the metric loss, the training might be on track.

What if the perplexity first goes higher and then goes lower? Does it mean the network utilizes codebook badly, so we should reduce the number of codebook (codebook size)

bigpon commented 7 months ago

Yes, it might imply that the codebook usage is low. To reduce the number of codebooks will result in markedly quality degradation. The better way is to adopt some advanced techniques to improve the codebook usage.

In this repo, we didn't adopt any codebook-usage-improving techniques but you may find some useful techniques from other popular neural codec repos.

a897456 commented 6 months ago

Hi, In stage 2, the model sometimes will suffer the mode collapse issue, and the vq_loss and mel_loss will become much higher. If the model is on track, the mel_loss and vq_loss will only slightly increase for around 1-2, so it is better to monitor the mel_loss and vq_loss during the stage 2 training. Furthermore, in most cases, the real_loss is usually similar to the fake_loss in a stable GAN model.

I looked at the paper and found that stage 1 seems to have only a 0.01 improvement, so I want to skip stage 1 and go straight to stage 2 (+HIFIGAN's discriminator). In .yaml I change the discriminator: 500000 to discriminator: 0 , right? ###########################################################

INTERVAL SETTING

########################################################### start_steps: # Number of steps to start training generator: 0 discriminator: 0
train_max_steps: 500000 # Number of training steps. (w/o adv) adv_train_max_steps: 1000000 # Number of training steps. (w/ adv) save_interval_steps: 100000 # Interval steps to save checkpoint. eval_interval_steps: 1000 # Interval steps to evaluate the network. log_interval_steps: 100 # Interval steps to record the training log.

a897456 commented 6 months ago

Hi,

The final me_loss varies from corpus to corpus. According to our results, the final mel_loss of VCTK is around 17 and LibriTTS is around 22.

The training time depends on the number of iterations and batch_size. With batch_size 0.2 seconds, 200k for metric-only training and another 500k for metric and adv training take around 2 days using an A100 GPU.

According to our experience, the iteration number of the stage 1 training should be increased according to the amount of training data, but it is fine to keep the same iteration number of the stage 2 training for different corpora. Therefore, we empirically set the stage 1 iteration number to 200k for VCTK and 500k for LibriTTS.

500k for LibriTTS???? does it mean in stage 1 like this: start_steps: generator: 0 discriminator: 500000 train_max_steps: 1000000

facebookresearch / AudioDec

Perplexity blows up at start of training #8

INTERVAL SETTING