We upload the pre-trained model, trained on VB with 0.5 as the dropout ratio.
We got:
csig:3.8357 cbak:3.2350 covl:3.1840 pesq:2.5430 ssnr:8.9398 stoi:0.9335 on VB (step 1=40, step 2=15)
csig:2.8673 cbak:2.1805 covl:2.1647 pesq:1.5709 ssnr:1.6121 stoi:0.8673 on CHIME-4 (step 1=35, step 2=0)
DOSE employs two efficient condition-augmentation techniques to address the challenge that incorporating condition information into DDPMs for SE, based on two key insights:
Experiments demonstrate that our approach yields substantial improvements in high-quality and stable speech generation, consistency with the condition factor, and efficiency.
Note: be careful with the repo version, especially pesq
We run the code on a computer with RTX-3090
, i7 13700KF
, and 128G
memory. The code was tested with python 3.8.13
, pytorch 1.13.1
, cudatoolkit 11.7.0
. Install the dependencies via Anaconda:
# create virtual environment
conda create --name DOSE python=3.8.13
# activate environment
conda activate DOSE
# install pytorch & cudatoolkit
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
# install speech metrics repo:
pip install https://github.com/ludlows/python-pesq/archive/master.zip
pip install pystoi
pip install librosa
# install utils (we use ``tensorboard`` for logging)
pip install tensorboard
Before you start training, you'll need to prepare a training dataset. The default dataset is VOICEBANK-DEMAND dataset. You can download them from VOICEBANK-DEMAND and resample them to 16 kHz. By default, this implementation assumes the sampling steps are 35&15
steps and the sample rate of 16 kHz. If you need to change these values, edit params.py.
We train the model via running:
python src/DOSE/__main__.py /path/to/model
We generate the audio via running:
python src/DOSE/inference.py /path/to/model /path/to/condition /path/to/output_dir
We evaluate the generated samples via running:
python src/DOSE/metric.py /path/to/clean_speech /path/to/output_dir
└── DOSE──
├── src
│ ├── init.py
│ ├── main.py # run the model for training
│ ├── dataset.py # Preprocess the dataset and fill/crop the speech for the model running
│ ├── inference.py # Run model for inferencing speech and adjust inference-steps
│ ├── learner.py # Load the model params for training/inferencing and saving checkpoints
│ ├── model.py # The neural network code of the proposed DOSE
│ ├── params.py # The diffusions, model, and speech params
└── README.md
The code of DOSE is developed based on the code of Diffwave