DOSE: Diffusion Dropout with Adaptive Prior for Speech Enhancement

Newest

We upload the pre-trained model, trained on VB with 0.5 as the dropout ratio.

We got:

csig:3.8357 cbak:3.2350 covl:3.1840 pesq:2.5430 ssnr:8.9398 stoi:0.9335 on VB (step 1=40, step 2=15)

csig:2.8673 cbak:2.1805 covl:2.1647 pesq:1.5709 ssnr:1.6121 stoi:0.8673 on CHIME-4 (step 1=35, step 2=0)

Brief

DOSE employs two efficient condition-augmentation techniques to address the challenge that incorporating condition information into DDPMs for SE, based on two key insights:

We force the model to prioritize the condition factor when generating samples by training it with dropout operation;
We incorporate the condition information into the sampling process by providing an informative adaptive prior.

Experiments demonstrate that our approach yields substantial improvements in high-quality and stable speech generation, consistency with the condition factor, and efficiency.

Environment Requirements

Note: be careful with the repo version, especially pesq

We run the code on a computer with RTX-3090, i7 13700KF, and 128G memory. The code was tested with python 3.8.13, pytorch 1.13.1, cudatoolkit 11.7.0. Install the dependencies via Anaconda:

# create virtual environment
conda create --name DOSE python=3.8.13

# activate environment
conda activate DOSE

# install pytorch & cudatoolkit
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

# install speech metrics repo:
pip install https://github.com/ludlows/python-pesq/archive/master.zip
pip install pystoi
pip install librosa

# install utils (we use ``tensorboard`` for logging)
pip install tensorboard

How to train

Before you start training, you'll need to prepare a training dataset. The default dataset is VOICEBANK-DEMAND dataset. You can download them from VOICEBANK-DEMAND and resample them to 16 kHz. By default, this implementation assumes the sampling steps are 35&15 steps and the sample rate of 16 kHz. If you need to change these values, edit params.py.

We train the model via running:

python src/DOSE/__main__.py /path/to/model

How to inference

We generate the audio via running:

python src/DOSE/inference.py /path/to/model /path/to/condition /path/to/output_dir

How to evaluate

We evaluate the generated samples via running:

python src/DOSE/metric.py /path/to/clean_speech /path/to/output_dir

Folder Structure

└── DOSE──
    ├── src
    │   ├── init.py 
    │   ├── main.py # run the model for training
    │   ├── dataset.py # Preprocess the dataset and fill/crop the speech for the model running
    │   ├── inference.py # Run model for inferencing speech and adjust inference-steps
    │   ├── learner.py # Load the model params for training/inferencing and saving checkpoints
    │   ├── model.py # The neural network code of the proposed DOSE
    │   ├── params.py # The diffusions, model, and speech params
    └── README.md

The code of DOSE is developed based on the code of Diffwave

ICDM-UESTC / DOSE

readme