This pipeline follows and extends the work of Albadawy & Lyu 2020. The work that used this shows (amongst other things) that their proposed voice conversion model is also applicable to context of musical instruments, therefore reforming the conversion to a more generalised audio style - timbre.
You can find the pre-print of this work here. Please be sure to reference it if you use this code for your research:
@misc{sammutbonnici2021timbre,
title={Timbre Transfer with Variational Auto Encoding and Cycle-Consistent Adversarial Networks},
author={Russell Sammut Bonnici and Charalampos Saitis and Martin Benning},
year={2021},
eprint={2109.02096},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
The implemented pipeline makes use of the following projects (click for orginating repos):
Female to Male:
Violin to Trumpet:
Recommended GPU VRAM per model:
0.1. Clone this repo as well as its sub modules for voice_conversion and wavenet_vocoder with git:
git clone https://github.com/RussellSB/tt-vae-gan.git
cd tt-vae-gan
git submodule init
git submodule update
0.2. Ensure that your environment has installed the dependencies of the submodules.
Choose:
Run one of the python commands for extracting timbre files of interest:
cd data_prep
python flickr --dataroot [path/to/flickr_audio/flickr_audio/] # For Flickr
python urmp --dataroot [path/to/urmp/] # For URMP
voice_conversion/data/data_[name]/
. [name]
would be either flickr
or urmp
Alternatively, you can use your own dataset. Just set it up so that in voice_conversion/data/data_mydataset
you have the following structure:
voice_conversion/data/data_mydataset
├── spkr_1
│ ├── sample.wav
├── spkr_2
│ ├── sample.wav
│ ...
└── spkr_N
├── sample.wav
...
# The directory under each speaker cannot be nested.
cd ../voice_conversion/src
python preprocess.py --dataset ../data/data_[name]
--n_spkrs [int]
. By default n_spkrs=2
.python train.py --model_name [expname] --dataset ../data/data_[name] --n_spkrs 2
--n_epochs [int]
(100 default)--checkpoint_interval [int]
(1 epoch by default)python inference.py --model_name [expname] --epoch [int] --trg_id 2 --src_id 1 --wavdir [path/to/testset_1]
--wavdir
you can do --wav
for a single file input.--wavdir ../../wavenet_vocoder/egs/gaussian/data/flickr_2/eval
.cd ../../data_prep
python wavenet.py --dataset ../voice_conversion/data/data_[name] --outdir ../wavenet_vocoder/egs/gaussian/data --tag [name]
cd ../wavenet_vocoder/egs/gaussian
spk="[name]_[id]" ./run.sh --stage 1 --stop-stage 1
1
or 2
. If you want to train all, make [id] as "_all" or somethingspk="[name]_[id]" hparams=conf/[name].json ./run.sh --stage 2 --stop-stage 2
line 78
in run.sh
.CUDA_VISIBLE_DEVICES="0,1"
before ./run.sh
if you have two GPUs (training takes quite long).spk="[name]_[id_2]" inferdir="[expname]_[epoch]_G[id_2]_S[id_1]" hparams=conf/flickr.json ./infer.sh
inferdir="initial_99_G2_S1"
.CUDA_VISIBLE_DEVICES="0,1"
before ./infer.sh
(inferring takes quite long).cd ../../../fad
mkdir -p data
curl -o data/vggish_model.ckpt https://storage.googleapis.com/audioset/vggish_model.ckpt
ls --color=never ../wavenet_vocoder/egs/gaussian/data/[name]_[id_2]/train_no_dev/*.wav > test_audio/[name]_[id_2].csv
ls --color=never ../wavenet_vocoder/egs/gaussian/out/[name]_[id_2]_[expname]_[epoch]_G[id_2]_S[id_1]/*_gen.wav > test_audio/[name]_[id_2]_[expname]_[epoch]_G[id_2]_S[id_1].csv
mkdir -p stats
python -m frechet_audio_distance.create_embeddings_main --input_files test_audio/[name]_[id_2].csv \
--stats stats/[name]_[id_2]_stats
python -m frechet_audio_distance.create_embeddings_main --input_files test_audio/[name]_[id_2]_[expname]_[epoch]_G[id_2].csv \
--stats stats/[name]_[id_2]_[expname]_[epoch]_G[id_2]_S[id_1]_stats
CUDA_VISIBLE_DEVICES="0,1"
before python if possible (embedding takes a while)python -m frechet_audio_distance.compute_fad --background_stats stats/[name]_[id_2]_stats \
--test_stats stats/[name]_[id_2]_[expname]_[epoch]_G[id_2]_S[id_1]_stats
With respect to the current data preperation set up, the following one-to-one VAE-GANs and specific vocoders were trained:
Model | Flickr | URMP |
---|---|---|
VAE-GAN | link | link |
WaveNet | link | link |
voice_conversion/src/saved_models/initial
.--model_name initial --epoch 99
for inference (with epoch 490 for URMP).Notes
wavenet_vocoder/egs/gaussian/exp/
flickr_1_train_no_dev_flickr
into that directory.wavenet_vocoder/egs/gaussian/dump/[spk]/logmelspectrogram/org
- where [spk]
corresponds to flickr_1
for example.spk="flickr_1" inferdir="initial_99_G1_S2"
.Notes
voice_conversion/src/out_infer/
. You can point it to any local dir within that path for input.