facebookresearch / AudioDec

An Open-source Streaming High-fidelity Neural Audio Codec
Other
445 stars 21 forks source link

How to execute denoising? #35

Open a897456 opened 4 weeks ago

a897456 commented 4 weeks ago

@bigpon Hi I'm trying to reproduce the denoising code. https://github.com/facebookresearch/AudioDec?tab=readme-ov-file#bonus-track-denoising You mentioned following the requirements in submit_denoise.sh in this paragraph "Prepare the noisy-clean corpus and follow the usage instructions in submit_denoise.sh to run the training and testing", but the execution code below is submit_autoencoder.sh. May I ask what should be done?

a897456 commented 4 weeks ago

https://github.com/facebookresearch/AudioDec/blob/5ec3ab9d53cff2f4d92163c0624a277d173703a5/trainer/denoise.py#L60 Is the denoising process the same as that of the autoencoder. Does it require training the metric_loss first and then fixing the weights to continue training?

a897456 commented 4 weeks ago

Hi @bigpon I completed 20,000 training sessions according to Stage 0 of submit_denoise.sh. However, when I started to execute Stage 1, it seemed that there was no response at all. https://github.com/facebookresearch/AudioDec/blob/5ec3ab9d53cff2f4d92163c0624a277d173703a5/bin/train.py#L106-L118 I suspect that perhaps the denoising process such as adv_train_max_steps or adv_batch_length doesn't require adversarial parameters, because I didn't find them in the configuration file like config/denoise/symAD_vctk_48000_hop300.yaml. https://github.com/facebookresearch/AudioDec/blob/5ec3ab9d53cff2f4d92163c0624a277d173703a5/config/denoise/symAD_vctk_48000_hop300.yaml#L174-L180

bigpon commented 3 weeks ago

Hi, there is a typo. For running the denoising process, you have to first update the encoder while fixing the codebook and decoder. I update the README. Please follow the steps there.

a897456 commented 3 weeks ago

https://github.com/facebookresearch/AudioDec/blob/9cc4e58aa684c96a1a299f38a7085e2426d20ad4/submit_denoise.sh#L44-L54 https://github.com/facebookresearch/AudioDec/blob/9cc4e58aa684c96a1a299f38a7085e2426d20ad4/config/denoise/symAD_vctk_48000_hop300.yaml#L27-L29 https://github.com/facebookresearch/AudioDec/blob/9cc4e58aa684c96a1a299f38a7085e2426d20ad4/codecTrain.py#L239-L255 I executed stage 0 according to submit_denoise.sh. However, I found that in the configuration file the file exp/autoencoder/symAD_vctk_48000_hop300/checkpoint-200000steps.pkl will be loaded as initial during stage 0. Do I need to train this file out in advance (for the new dataset)?

a897456 commented 3 weeks ago

Hi @bigpon Could you help me analyze whether my understanding is correct or not? THS

  1. First, according to the config/autoencoder/symAD_vctk_48000_hop300.yaml, perform autoencoder training on the clean speech for 200k steps to obtain a exp/autoencoder/symAD_vctk_48000_hop300/checkpoint-200000steps.pkl file.
  2. Then, according to the config/denoise/symAD_vctk_48000_hop300.yaml, simultaneously use the exp/autoencoder/symAD_vctk_48000_hop300/checkpoint-200000steps.pkl file obtained in step 1 as the initial, and conduct 200k steps of denoise training on both the clean speech and the noisy speech to get another file exp/denoise/symAD_vctk_48000_hop300/checkpoint-200000steps.pkl. Up to this point, the denoising training process is completed. The testing process is as follows: 1.In codetest.py, set process.encoder=decoder=exp/denoise/symAD_vctk_48000_hop300/checkpoint-200000steps.pkl to complete the testing
bigpon commented 3 weeks ago

Hi, in the first step, you have to train the decoder for another 500k iteration with GAN.

In the final step, you should take the decoder from the one trained with GAN.

a897456 commented 3 weeks ago

Hi @bigpon I carried out the denoising process as you suggested. However, when I tested the PESQ score of the output audio, it was only 1.6. Meanwhile, I also listened to it and subjectively felt that it was just so-so. The following is the denoising process. Do you have any ways to improve the effect? Thank you. image

a897456 commented 3 weeks ago

Hi bigpon, My idea was to add the training of the discriminator in denoise.py, imitating the method in autoencoder.py. I actually did it this way, but the results still didn't improve. image

bigpon commented 2 weeks ago

Because of the phase misaligned issue (you can check our paper ScoreDec), AudioDec usually achieves low PESQ even when the input is clean speech. Using multi-resolution mel-loss can improve the PESQ but it still cannot achieve a very high PESQ score.

For perceptual quality, although the PESQ score is low, the quality should be OK.

However, since it is just a simple approach to update only the encoder, it only achieves an OK performance, which still falls behind the SOTA speech enhancement methods.

a897456 commented 2 weeks ago

Because of the phase misaligned issue (you can check our paper ScoreDec), AudioDec usually achieves low PESQ even when the input is clean speech. Using multi-resolution mel-loss can improve the PESQ but it still cannot achieve a very high PESQ score.

Hi @bigpon 1.When is the ScoreDec expected to be open sourced? 2.Can the phase problem be compensated by setting use_shape_loss=true? I see that this value is always false in the configuration file.

bigpon commented 2 weeks ago

Hi,

  1. We don't have any plan to release ScoreDec since people can easily train the post-filter model from this repo https://github.com/sp-uhh/sgmse . That is, once you prepare the AudioDec-coded- and natural- speech pair as the noisy and clean pairs, you can train a sgmse-based postfilter. Actually, I also used sgmse to do denoising, and it works well. Therefore, I recommend you use your current trained AudioDec (w/o the GAN training part, i.e. only the 1st stage) to prepare noisy-clean speech pairs, and then train a sgmse model with these pairs. After that, you can get a high-quality denoising codec (The phase is also aligned well). The only problem is the inference time is very slow because of the sgmse model.

  2. No. The shape loss mostly improves the loudness modeling, and it cannot improve the phase modeling.

a897456 commented 2 weeks ago

Therefore, I recommend you use your current trained AudioDec (w/o the GAN training part, i.e. only the 1st stage) to prepare noisy-clean speech pairs, and then train a sgmse model with these pairs.

Hi @bigpon When it comes to preparing noisy-clean speech pairs, does it mean that the new noisy speech obtained after the original noisy speech goes through AudioDec (w/o the GAN training part, i.e. only the 1st stage) should be grouped with the original clean speech? Or do both the original noisy speech and the original clean speech need to go through AudioDec?

bigpon commented 2 weeks ago

Hi, in this case, we want the postfilter to do two things.

  1. remove the noise
  2. compensate the codec distortion

Therefore the target speech is the clean speech without any process (i.e. the ground truth). The noisy/input speech can be Type I. noisy speech processed by 1st-stage AudioDec (suffering from both noise and codec distortions) Type II. Clean speech processed by 1st-stage AudioDec (suffering from only codec distortions)

I have tried to use only I or I+II to train the postfilter. For noisy speech, their performances are similar. For clean speech, the model trained with I + II is better.

Therefore, I suggest you prepare both (Type-I, clean_speech) and (Type-II, clean_speech) pairs to train the postfilter.

a897456 commented 2 weeks ago

图片 Hi @bigpon I reorganized the dataset according to the suggestions you gave me. Then, under all the default settings, I carried out the training of SGMSE. The purple PESQ curve represents the unprocessed dataset, while the green PESQ curve represents the dataset that has been processed by Audiodec (including clean speech and noisy speech). However, I feel that the upward trend of PESQ has become sluggish. I guess that perhaps the SGMSE might require some specific settings. But I have been using the default settings completely. I will update the results here again. Meanwhile, if you can identify where the problem lies, please remind me in a timely manner.

a897456 commented 1 week ago

Hi @bigpon Is SGMSE already obsolete? I see that the PESQ scores of many speech enhancement models have already reached 3.6. image image image

a897456 commented 1 week ago

Hi @bigpon 图片 The curve of PESQ is still quite poor. I think there are some problems with the settings I've made, but I still haven't managed to find the correct ones. Could you please provide the setting parameters you had at that time? Including parameters such as backbone and SDE. I would be extremely grateful.

a897456 commented 1 week ago

Hi @bigpon image Are you using the settings of M6? Or something else? Could you disclose it?

bigpon commented 1 week ago

Hi, I used SGMSE+ since the work was done in 2023. If you find any advanced SE model, you should try it. Besides, please note that the PESQ number in our ScoreDec paper is for clean speech, which means the Score-based postfilter only tackles the codec distortions.

We didn't include the SE results of the noisy speech in the ScoreDec paper.

For the PESQ number of the SE results of ScoreDec, I don't remember the exact number but the number was worse than SGMSE+, which is also reasonable since it suffers not only noise distortion but also codec distortions.