facebookresearch / AudioDec

An Open-source Streaming High-fidelity Neural Audio Codec
Other
400 stars 21 forks source link

feature matching loss and adversarial loss rise steadily? #32

Closed Chengbin-Liang closed 1 month ago

Chengbin-Liang commented 1 month ago

Hi authors,

I am trying to train low-bit-rate codecs for 8kHz sample rate audio, specifically targeting bit rates of 3.2 kbps, 1.8 kbps, and 1.2 kbps. I set the hop_size to 240 (which factors as 2 3 4 * 6). The corresponding mel_loss n_fft_length is set to 512, and I increased the batch_length to 19200. The LibriTTS dataset, after downsampling, was used for training, and the dataset selection was consistent with what you mentioned in #10.

As shown in the figure, my mel_loss and vq_loss in the second stage were only slightly higher (1 to 2) than when I initially entered the second stage. However, I observed the following phenomena:

(1) As GAN training progresses, feature matching loss and adversarial loss rise steadily (from 5 to 27). They may eventually stabilize at a certain value. (2) Real_loss and fake_loss gradually decrease from a value greater than 1, and their magnitudes are almost the same.

I am unsure whether my model is on the right track under these circumstances. I would be grateful if you could spare some time to provide feedback.

image

Chengbin-Liang commented 1 month ago

image

Now my picture looks like this, and I don't know what the problem is. Your reply means a lot to me. THS.

bigpon commented 1 month ago

Hi, overall, I am not aware of any problems with the tendancy of your training losses. The feature matching usually goes higher during the GAN training, and mel-loss also goes slightly higher during the GAN training.

The only problem is that your mel-loss is so high even without the GAN training, and the very low-bitrate requirements cause it. Reducing the number of hop_size or increasing the number of codebooks will improve it.

Chengbin-Liang commented 1 month ago

Thank you very much, the effect I trained was acceptable at low bit rates, but I had a problem training the denoising version, in the VCTK dataset, after spending 200000steps to warm up, I did the following two things simultaneously. ①Denoise training: The weight file "\exp\autoencoder\symAD_vctk_8000_hop240\ **checkpot-200000Steps.pkl" is used to initialize the denoising training. 200000steps were trained with noise-clean speech pairs in submit_denoise.sh to obtain "\exp\denoise\symAD_vctk_8000_hop240\ checkpot-200000steps.pkl" ②GAN training: Load the initial pre-heated weight "\exp\denoise\symAD_vctk_8000_hop240\ checkpot-200000steps.pkl"(and not the version with denoise training), and use clean speech for GAN training 500000steps to obtain "\exp\autoencoder\symAD_vctk_8000_hop240\checkpoin When t-700000steps.pkl "

But when I use "\exp\denoise\symAD_vctk_8000_hop240\ checkport-200000steps.pkl" and "\exp\autoencoder\symAD_vctk_8000_hop240\checkpoint-700000steps.pkl" tests noisy speech, I found it can hardly hear any useful signal, as if both sides had picked up the wrong codebook

I didn't use a vocoder.

I'm wondering if the communication broke down after I updated the encoder separately. Should I continue a GAN training with an updated encoder and clean data set?

bigpon commented 1 month ago

Hi, For the denoise training, we only have to update the encoder so the 2-stage training is unnecessary. Since you train the model w/o GAN using 500k iterations, you have to take \exp\autoencoder\symAD_vctk_8000_hop240\ **checkpot-500000Steps.pkl as the initial and train the encoder for another 500k iterations under the denoise training.

Chengbin-Liang commented 1 month ago

Thanks for your advice, the quality is much better after I modified it! This is excellent work!

In addition, I noticed that the pre-training weight you released is about 35 MB, while the weight of the model I trained myself is 300 MB and 900 MB. May I ask if you have adopted any model lightweight method for the weight of the released model?

bigpon commented 1 month ago

Hi, I released only the generator, which is only 35MB. The default saved model includes a generator and several discriminators, so the total size is much larger.