Open a897456 opened 4 weeks ago
https://github.com/facebookresearch/AudioDec/blob/5ec3ab9d53cff2f4d92163c0624a277d173703a5/trainer/denoise.py#L60 Is the denoising process the same as that of the autoencoder. Does it require training the metric_loss first and then fixing the weights to continue training?
Hi @bigpon
I completed 20,000 training sessions according to Stage 0
of submit_denoise.sh
. However, when I started to execute Stage 1
, it seemed that there was no response at all.
https://github.com/facebookresearch/AudioDec/blob/5ec3ab9d53cff2f4d92163c0624a277d173703a5/bin/train.py#L106-L118
I suspect that perhaps the denoising process such as adv_train_max_steps
or adv_batch_length
doesn't require adversarial parameters, because I didn't find them in the configuration file like config/denoise/symAD_vctk_48000_hop300.yaml
.
https://github.com/facebookresearch/AudioDec/blob/5ec3ab9d53cff2f4d92163c0624a277d173703a5/config/denoise/symAD_vctk_48000_hop300.yaml#L174-L180
Hi, there is a typo. For running the denoising process, you have to first update the encoder while fixing the codebook and decoder. I update the README. Please follow the steps there.
https://github.com/facebookresearch/AudioDec/blob/9cc4e58aa684c96a1a299f38a7085e2426d20ad4/submit_denoise.sh#L44-L54
https://github.com/facebookresearch/AudioDec/blob/9cc4e58aa684c96a1a299f38a7085e2426d20ad4/config/denoise/symAD_vctk_48000_hop300.yaml#L27-L29
https://github.com/facebookresearch/AudioDec/blob/9cc4e58aa684c96a1a299f38a7085e2426d20ad4/codecTrain.py#L239-L255
I executed stage 0
according to submit_denoise.sh
. However, I found that in the configuration file the file exp/autoencoder/symAD_vctk_48000_hop300/checkpoint-200000steps.pkl
will be loaded as initial
during stage 0
. Do I need to train this file out in advance (for the new dataset)?
Hi @bigpon Could you help me analyze whether my understanding is correct or not? THS
config/autoencoder/symAD_vctk_48000_hop300.yaml
, perform autoencoder training on the clean speech for 200k steps to obtain a exp/autoencoder/symAD_vctk_48000_hop300/checkpoint-200000steps.pkl
file.config/denoise/symAD_vctk_48000_hop300.yaml
, simultaneously use the exp/autoencoder/symAD_vctk_48000_hop300/checkpoint-200000steps.pkl
file obtained in step 1 as the initial
, and conduct 200k steps of denoise training on both the clean speech and the noisy speech to get another file exp/denoise/symAD_vctk_48000_hop300/checkpoint-200000steps.pkl
.
Up to this point, the denoising training process is completed. The testing process is as follows:
1.In codetest.py, set process.encoder=decoder=exp/denoise/symAD_vctk_48000_hop300/checkpoint-200000steps.pkl
to complete the testing Hi, in the first step, you have to train the decoder for another 500k iteration with GAN.
In the final step, you should take the decoder from the one trained with GAN.
Hi @bigpon I carried out the denoising process as you suggested. However, when I tested the PESQ score of the output audio, it was only 1.6. Meanwhile, I also listened to it and subjectively felt that it was just so-so. The following is the denoising process. Do you have any ways to improve the effect? Thank you.
Hi bigpon,
My idea was to add the training of the discriminator in denoise.py
, imitating the method in autoencoder.py
. I actually did it this way, but the results still didn't improve.
Because of the phase misaligned issue (you can check our paper ScoreDec), AudioDec usually achieves low PESQ even when the input is clean speech. Using multi-resolution mel-loss can improve the PESQ but it still cannot achieve a very high PESQ score.
For perceptual quality, although the PESQ score is low, the quality should be OK.
However, since it is just a simple approach to update only the encoder, it only achieves an OK performance, which still falls behind the SOTA speech enhancement methods.
Because of the phase misaligned issue (you can check our paper ScoreDec), AudioDec usually achieves low PESQ even when the input is clean speech. Using multi-resolution mel-loss can improve the PESQ but it still cannot achieve a very high PESQ score.
Hi @bigpon 1.When is the
ScoreDec
expected to be open sourced? 2.Can the phase problem be compensated by setting use_shape_loss=true? I see that this value is always false in the configuration file.
Hi,
We don't have any plan to release ScoreDec since people can easily train the post-filter model from this repo https://github.com/sp-uhh/sgmse . That is, once you prepare the AudioDec-coded- and natural- speech pair as the noisy and clean pairs, you can train a sgmse-based postfilter. Actually, I also used sgmse to do denoising, and it works well. Therefore, I recommend you use your current trained AudioDec (w/o the GAN training part, i.e. only the 1st stage) to prepare noisy-clean speech pairs, and then train a sgmse model with these pairs. After that, you can get a high-quality denoising codec (The phase is also aligned well). The only problem is the inference time is very slow because of the sgmse model.
No. The shape loss mostly improves the loudness modeling, and it cannot improve the phase modeling.
Therefore, I recommend you use your current trained AudioDec (w/o the GAN training part, i.e. only the 1st stage) to prepare noisy-clean speech pairs, and then train a sgmse model with these pairs.
Hi @bigpon When it comes to preparing noisy-clean speech pairs, does it mean that the new noisy speech obtained after the original noisy speech goes through AudioDec (w/o the GAN training part, i.e. only the 1st stage) should be grouped with the original clean speech? Or do both the original noisy speech and the original clean speech need to go through AudioDec?
Hi, in this case, we want the postfilter to do two things.
Therefore the target speech is the clean speech without any process (i.e. the ground truth). The noisy/input speech can be Type I. noisy speech processed by 1st-stage AudioDec (suffering from both noise and codec distortions) Type II. Clean speech processed by 1st-stage AudioDec (suffering from only codec distortions)
I have tried to use only I or I+II to train the postfilter. For noisy speech, their performances are similar. For clean speech, the model trained with I + II is better.
Therefore, I suggest you prepare both (Type-I, clean_speech) and (Type-II, clean_speech) pairs to train the postfilter.
Hi @bigpon I reorganized the dataset according to the suggestions you gave me. Then, under all the default settings, I carried out the training of SGMSE. The purple PESQ curve represents the unprocessed dataset, while the green PESQ curve represents the dataset that has been processed by Audiodec (including clean speech and noisy speech). However, I feel that the upward trend of PESQ has become sluggish. I guess that perhaps the SGMSE might require some specific settings. But I have been using the default settings completely. I will update the results here again. Meanwhile, if you can identify where the problem lies, please remind me in a timely manner.
Hi @bigpon Is SGMSE already obsolete? I see that the PESQ scores of many speech enhancement models have already reached 3.6.
Hi @bigpon
The curve of PESQ
is still quite poor. I think there are some problems with the settings I've made, but I still haven't managed to find the correct ones. Could you please provide the setting parameters you had at that time? Including parameters such as backbone
and SDE
. I would be extremely grateful.
Hi @bigpon Are you using the settings of M6? Or something else? Could you disclose it?
Hi, I used SGMSE+ since the work was done in 2023. If you find any advanced SE model, you should try it. Besides, please note that the PESQ number in our ScoreDec paper is for clean speech, which means the Score-based postfilter only tackles the codec distortions.
We didn't include the SE results of the noisy speech in the ScoreDec paper.
For the PESQ number of the SE results of ScoreDec, I don't remember the exact number but the number was worse than SGMSE+, which is also reasonable since it suffers not only noise distortion but also codec distortions.
@bigpon Hi I'm trying to reproduce the denoising code. https://github.com/facebookresearch/AudioDec?tab=readme-ov-file#bonus-track-denoising You mentioned following the requirements in
submit_denoise.sh
in this paragraph "Prepare the noisy-clean corpus and follow the usage instructions in submit_denoise.sh to run the training and testing", but the execution code below issubmit_autoencoder.sh
. May I ask what should be done?