This work is about training a speech quality estimator and enhancement model WITHOUT any labeled (paired) data. Specifically, during training, we only need CLEAN speech for model training.
CUDA Version: 12.2
python: 3.8
If you want to train from scratch, please download the dataset to the corresponding path depicted in the .csv and .pickle files.
Speech enhancement:
=> Training: clean speech of VoiceBank-DEMAND trainset (Its original sampling rate is 48kHz, you have to down-sample it to 16kHz)
=> validation: As in MetricGAN-U, noisy speech (speakers p226 and p287) of VoiceBank-DEMAND trainset
=> Evaluation: noisy speech of VoiceBank-DEMAND testset and DNS1 and DNS3
Quality estimation (VQScore):
=> Training: LibriSpeech clean-460 hours
=> validation: noisy speech of VoiceBank-DEMAND testset
=> Evaluation: Tencent and IUB
To Train our speech enhancement model (using only Clean Speech). Below is an example command.
python trainVQVAE.py \
-c config/SE_cbook_4096_1_128_lr_1m5_1m5_github.yaml \
--tag SE_cbook_4096_1_128_lr_1m5_1m5_github
To Train our speech quality estimator, VQScore. Below is an example command.
python trainVQVAE.py \
-c config/QE_cbook_size_2048_1_32_IN_input_encoder_z_Librispeech_clean_github.yaml \
--tag QE_cbook_size_2048_1_32_IN_input_encoder_z_Librispeech_clean_github
Below is an example command for generating enhanced speech/ estimated quality scores from the model. Where '-c' is the path of the config file, '-m' is the path of the pre-trained model, and '-i' is the path of the input wav file.
python inference.py \
-c ./config/SE_cbook_4096_1_128_lr_1m5_1m5_github.yaml \
-m ./exp/SE_cbook_4096_1_128_lr_1m5_1m5_github/checkpoint-dnsmos_ovr=2.761_AT.pkl \
-i ./noisy_p232_005.wav
python inference.py \
-c ./config/QE_cbook_size_2048_1_32_IN_input_encoder_z_Librispeech_clean_github.yaml \
-m ./exp/QE_cbook_size_2048_1_32_IN_input_encoder_z_Librispeech_clean_github/checkpoint-dnsmos_ovr_CC=0.835.pkl \
-i ./noisy_p232_005.wav
We provide the checkpoints of trained models in the corresponding ./exp/config_name folder.
As shown in the following spectrogram, the applied adversarial noise doesn't have a fixed pattern as Gaussian noise. So it may be a good one to train a robust speech enhancement model.
I'm open to collaboration! If you find this Self-Supervised SE/QE topic interesting, please let me know (e-mail: szuweif@nvidia.com).
If you find the code useful in your research, please cite our ICLR paper :)