JasonSWFu / VQscore

36 stars 3 forks source link

Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech (ICLR 2024)

Szu-Wei Fu, Kuo-Hsuan Hung, Yu Tsao, Yu-Chiang Frank Wang

Introduction

This work is about training a speech quality estimator and enhancement model WITHOUT any labeled (paired) data. Specifically, during training, we only need CLEAN speech for model training.

Environment

CUDA Version: 12.2

python: 3.8

Dataset used in the paper/code

If you want to train from scratch, please download the dataset to the corresponding path depicted in the .csv and .pickle files.

Speech enhancement:

=> Training: clean speech of VoiceBank-DEMAND trainset (Its original sampling rate is 48kHz, you have to down-sample it to 16kHz)

=> validation: As in MetricGAN-U, noisy speech (speakers p226 and p287) of VoiceBank-DEMAND trainset

=> Evaluation: noisy speech of VoiceBank-DEMAND testset and DNS1 and DNS3

Quality estimation (VQScore):

=> Training: LibriSpeech clean-460 hours

=> validation: noisy speech of VoiceBank-DEMAND testset

=> Evaluation: Tencent and IUB

Training

To Train our speech enhancement model (using only Clean Speech). Below is an example command.

python trainVQVAE.py \
-c config/SE_cbook_4096_1_128_lr_1m5_1m5_github.yaml \
--tag SE_cbook_4096_1_128_lr_1m5_1m5_github

To Train our speech quality estimator, VQScore. Below is an example command.

python trainVQVAE.py \
-c config/QE_cbook_size_2048_1_32_IN_input_encoder_z_Librispeech_clean_github.yaml \
--tag QE_cbook_size_2048_1_32_IN_input_encoder_z_Librispeech_clean_github

Inference

Below is an example command for generating enhanced speech/ estimated quality scores from the model. Where '-c' is the path of the config file, '-m' is the path of the pre-trained model, and '-i' is the path of the input wav file.

python inference.py \
-c ./config/SE_cbook_4096_1_128_lr_1m5_1m5_github.yaml \
-m ./exp/SE_cbook_4096_1_128_lr_1m5_1m5_github/checkpoint-dnsmos_ovr=2.761_AT.pkl \
-i ./noisy_p232_005.wav
python inference.py \
-c ./config/QE_cbook_size_2048_1_32_IN_input_encoder_z_Librispeech_clean_github.yaml \
-m ./exp/QE_cbook_size_2048_1_32_IN_input_encoder_z_Librispeech_clean_github/checkpoint-dnsmos_ovr_CC=0.835.pkl \
-i ./noisy_p232_005.wav

Pretrained Models

We provide the checkpoints of trained models in the corresponding ./exp/config_name folder.

Adversarial noise

As shown in the following spectrogram, the applied adversarial noise doesn't have a fixed pattern as Gaussian noise. So it may be a good one to train a robust speech enhancement model.

Collaboration

I'm open to collaboration! If you find this Self-Supervised SE/QE topic interesting, please let me know (e-mail: szuweif@nvidia.com).

Citation

If you find the code useful in your research, please cite our ICLR paper :)

References