Closed lixinghe1999 closed 4 months ago
Hi, the vq_loss becoming higher during training is normal since the encoder usually outputs white noise like latent in the beginning. When the encoder starts to learn something meaningful will make the quantization difficult to reconstruct resulting in higher vq_loss.
The mel_loss will also become higher during GAN training since the objective of the GAN training is cheating the discriminator not reducing the mel loss.
However, if the vq_loss or mel_loss did not converge, it is a problem. According to your setting, I think the temporal-resolution downsampling ratio might be too high (enc_strides: [3, 4, 5, 5], dec_strides: [5, 5, 4, 3] make the downsampling ratio=300).
Taking a smaller temopral-resolution downsampling ratio may ease the problem. (For example, enc_strides: [2, 3, 4, 5], dec_strides: [5, 4, 3, 2])
Taking a smaller temopral-resolution downsampling ratio may ease the problem. (For example, enc_strides: [2, 3, 4, 5], dec_strides: [5, 4, 3, 2])
I donot think so, because [2,3,4,5] means the downsampling ratio=120, 9600/120=80 > 64(codebook_dim)
same question https://github.com/facebookresearch/AudioDec/issues/19 I would like to know how to adjust the parameters in config to achieve the best output for 16kHz input data. How did you finally adjust it? @lixinghe1999
because [2,3,4,5] means the downsampling ratio=120, 9600/120=80 > 64(codebook_dim)
Hi, the downsampling is for the temporal axis, so it should be 48000 (48kHz)/120=400Hz of the codes, which is different from the code dimension 64. That is, for each second, you will get 400 64 (number of RVQ, here is 8).
batch_length: 9600
yes, but batch_length: 9600 9600/120=80, so I think the stride should be changed with batch_length
from my understanding, the batch_length only influences the gpu memory consumption so normally we don't need to worry about it (as long as it can be divided by downsample rate). the codebook dim you mentioned seems only work on the single time-frame, not relevant to the batch_length. please correct me if i am wrong.
Yes, the batch_length is more related to the GPU useage, and the only requirement is that it can be divided by the downsample rate.
I actually found that the longer batch_length the better performance, which is straightforward, but the longer batch_length results in much longer training time in the second stage (w/ the GAN training).
However, the longer batch_length do not significantly increase the training time in the first stage, so I use 96000 in the 1st stage and 9600 in the 2nd stage in my latest settings.
I am working on my dataset, whose channels = 2 and sampling rate = 16000. I paste my config file below, the major changes I made are: 1) sample_rate 2) data path 3) input output channel
1) In the stage 1 training (<500k), the mel_loss seems reasonable, but the vq_loss gets larger and larger, which seems weird.
2) In the stage 2 training, my mel loss will go much higher. Is the reason 1) I set the wrong lambda_adv or 2) the problem caused by bad vq_loss? What is the recommended way to work on it?
Thank you in advance!