Closed archie1993 closed 10 months ago
Hi, the perplexity is related to the utilization of the codebook. The higher the perplexity, the better the network utilizes the codebook. Therefore, the increasing perplexity during training is reasonable.
The vq_loss usually increases for a while and then fluctuates in a specific range.
During training, the most important value might be the mel_loss. If the mel_loss continually decreases while the model training with only the metric loss, the training might be on track.
Hi Yi, thanks for the clarification. Do you know what is a good target for the train/eval mel_loss to be able to say that we have trained a decent model?
Also, I had a couple of other questions:
Hi,
Hey Yi
What are the more important metrics to focus on for stage 2 training? Can you also share the optimal values you observed for them at convergence?
Thanks!
Hi, In stage 2, the model sometimes will suffer the mode collapse issue, and the vq_loss and mel_loss will become much higher. If the model is on track, the mel_loss and vq_loss will only slightly increase for around 1-2, so it is better to monitor the mel_loss and vq_loss during the stage 2 training. Furthermore, in most cases, the real_loss is usually similar to the fake_loss in a stable GAN model.
Hi, the perplexity is related to the utilization of the codebook. The higher the perplexity, the better the network utilizes the codebook. Therefore, the increasing perplexity during training is reasonable.
The vq_loss usually increases for a while and then fluctuates in a specific range.
During training, the most important value might be the mel_loss. If the mel_loss continually decreases while the model training with only the metric loss, the training might be on track.
What if the perplexity first goes higher and then goes lower? Does it mean the network utilizes codebook badly, so we should reduce the number of codebook (codebook size)
Yes, it might imply that the codebook usage is low. To reduce the number of codebooks will result in markedly quality degradation. The better way is to adopt some advanced techniques to improve the codebook usage.
In this repo, we didn't adopt any codebook-usage-improving techniques but you may find some useful techniques from other popular neural codec repos.
Hi, In stage 2, the model sometimes will suffer the mode collapse issue, and the vq_loss and mel_loss will become much higher. If the model is on track, the mel_loss and vq_loss will only slightly increase for around 1-2, so it is better to monitor the mel_loss and vq_loss during the stage 2 training. Furthermore, in most cases, the real_loss is usually similar to the fake_loss in a stable GAN model.
I looked at the paper and found that stage 1 seems to have only a 0.01 improvement, so I want to skip stage 1 and go straight to stage 2 (+HIFIGAN's discriminator).
In .yaml I change the discriminator: 500000
to discriminator: 0
, right?
###########################################################
###########################################################
start_steps: # Number of steps to start training
generator: 0
discriminator: 0
train_max_steps: 500000 # Number of training steps. (w/o adv)
adv_train_max_steps: 1000000 # Number of training steps. (w/ adv)
save_interval_steps: 100000 # Interval steps to save checkpoint.
eval_interval_steps: 1000 # Interval steps to evaluate the network.
log_interval_steps: 100 # Interval steps to record the training log.
Hi,
- The final me_loss varies from corpus to corpus. According to our results, the final mel_loss of VCTK is around 17 and LibriTTS is around 22.
- The training time depends on the number of iterations and batch_size. With batch_size 0.2 seconds, 200k for metric-only training and another 500k for metric and adv training take around 2 days using an A100 GPU.
- According to our experience, the iteration number of the stage 1 training should be increased according to the amount of training data, but it is fine to keep the same iteration number of the stage 2 training for different corpora. Therefore, we empirically set the stage 1 iteration number to 200k for VCTK and 500k for LibriTTS.
500k for LibriTTS
????
does it mean in stage 1 like this:
start_steps:
generator: 0
discriminator: 500000
train_max_steps: 1000000
Hi authors,
I am training the AudioDec model from scratch, with a 16 kHz dataset; each file in the dataset is around 20 seconds. I modified the hyper params as mentioned in this thread. As I start training, I observe that the perplexity starts increasing almost immediately. The vqloss is also steadily increasing, as can be seen from the following logs.
I tried reducing the batch size and learning rate, but that did not help. Do you have any idea why this may be happening?