Source Separation is a repository to extract speeches from various recorded sounds. It focuses to adapt more real-like dataset for training models.
The latest model in this repository is basically built with spectrogram based models. In mainly, Phase-aware Speech Enhancement with Deep Complex U-Net are implemented with modifications.
And then, To more stable inferences in real cases, below things are adopted.
Dataset source is opened on audioset_augmentor. See this link for finding explanations about audioset. This repo used Balanced train label dataset (Label balanced, non-human classes, 18055 samples)
It's not official implementation by authors of paper.
Singing Voice Separation with DSD100 dataset!* This model is trained with larger model and higher sample rate (44.1k). So it gives more stable and high quality audio. Let's checkout Youtube Playlist with samples of my favorites!
You can use pre-defined preprocessing and dataset sources on https://github.com/Appleholic/pytorch_sound
There are three external repositories on this repository. These will be updated to setup with recursive clone or internal codes
It is built with using pytorch_sound. So that, pytorch_sound is a modeling toolkit that allows engineers to train custom models for sound related tasks. Many of sources in this repository are based on pytorch_sound template.
Explained it on above section. link
For evaluation, PESQ python wrapper repository is added.
General Voice Source Separation
Singing Voice Separation
Current Tag : v0.1.1
General Voice Source Separation
Validation 10 random samples
Link : Google Drive
Test Samples :
Link : Google Drive
Singing Voice Seperation
You should see first README.md of audioset_augmentor and pytorch_sound, to prepare dataset and to train separation models.
$ pip install git+https://github.com/Appleholic/audioset_augmentor
$ pip install git+https://github.com/Appleholic/pytorch_sound@v0.0.3
$ pip install git+https://github.com/ludlows/python-pesq # for evaluation code
$ pip install -e .
$ python source_separation/train.py [YOUR_META_DIR] [SAVE_DIR] [MODEL NAME, see settings.py] [SAVE_PREFIX] [[OTHER OPTIONS...]]
$ python source_separation/train_jointly.py [YOUR_VOICE_BANK_META_DIR] [YOUR_DSD100_META_DIR] [SAVE_DIR] [MODEL NAME, see settings.py] [SAVE_PREFIX] [[OTHER OPTIONS...]]
Single sample
$ python source_separation/synthesize.py separate [INPUT_PATH] [OUTPUT_PATH] [MODEL NAME] [PRETRAINED_PATH] [[OTHER OPTIONS...]]
Whole validation samples (with evaluation)
$ python source_separation/synthesize.py validate [YOUR_META_DIR] [MODEL NAME] [PRETRAINED_PATH] [[OTHER OPTIONS...]]
All samples in given directory.
$ python source_separation/synthesize.py test-dir [INPUT_DIR] [OUTPUT_DIR] [MODEL NAME] [PRETRAINED_PATH] [[OTHER OPTIONS...]]
General Voice Separation
Singing Voice Separation
It is tuned to find out good validation WSDR loss
PESQ score is evaluated all validation dataset, but wdsr loss is picked with best loss of small subset while training is going on.
Results may vary slightly depending on the meta file, random state.
- The validation results tend to be different from the test results.
- Original sample rate is 22050, but PESQ needs 16k. So audios are resampled for calculating PESQ.
training type | score name | value |
---|---|---|
without audioset | PESQ | 2.346 |
without audioset | wsdr loss | -0.9389 |
with audioset | PESQ | 2.375 |
with audioset | wsdr loss | -0.9078 |
training type | value |
---|---|
dsd only | -0.9593 |
joint with voice bank | -0.9325 |
This repository is developed by ILJI CHOI. It is distributed under Apache License 2.0.