RussellSB / tt-vae-gan

Timbre transfer with variational autoencoding and cycle-consistent adversarial networks. Able to transfer the timbre of an audio source to that of another.
65 stars 16 forks source link
generative-adversarial-network music speech timbre timbre-transfer variational-autoencoder voice-conversion-gan

Timbre Transfer with VAE-GAN & WaveNet

This pipeline follows and extends the work of Albadawy & Lyu 2020. The work that used this shows (amongst other things) that their proposed voice conversion model is also applicable to context of musical instruments, therefore reforming the conversion to a more generalised audio style - timbre.

You can find the pre-print of this work here. Please be sure to reference it if you use this code for your research:

      title={Timbre Transfer with Variational Auto Encoding and Cycle-Consistent Adversarial Networks}, 
      author={Russell Sammut Bonnici and Charalampos Saitis and Martin Benning},


The implemented pipeline makes use of the following projects (click for orginating repos):

  1. voice_conversion - Performs VAE-GAN style transfer in the time-frequency melspectrogram domain.
  2. wavenet_vocoder - Vocodes melspectrogram output from style transfer model to realistic audio.
  3. fad - Computes Fréchet Audio Distance (using VGGish) to evaluate the quality of wavenet vocoder output.



Female to Male:


Violin to Trumpet:



Recommended GPU VRAM per model:



0. Setup

0.1. Clone this repo as well as its sub modules for voice_conversion and wavenet_vocoder with git:

git clone
cd tt-vae-gan 
git submodule init 
git submodule update

0.2. Ensure that your environment has installed the dependencies of the submodules.


1.0. Download the dataset.


1.1. Prepare your data.

Run one of the python commands for extracting timbre files of interest:

cd data_prep
python flickr --dataroot [path/to/flickr_audio/flickr_audio/]  # For Flickr
python urmp --dataroot [path/to/urmp/]  # For URMP

Alternatively, you can use your own dataset. Just set it up so that in voice_conversion/data/data_mydataset you have the following structure:

├── spkr_1
│   ├── sample.wav
├── spkr_2
│   ├── sample.wav
│   ...
└── spkr_N
    ├── sample.wav
# The directory under each speaker cannot be nested.

1.2. Preprocess your data

cd ../voice_conversion/src
python --dataset ../data/data_[name]

1.3. Train on your data.

python --model_name [expname] --dataset ../data/data_[name] --n_spkrs 2

1.4. Infer with VAE-GAN and reconstruct raw audio with Griffin Lim.

python --model_name [expname] --epoch [int] --trg_id 2 --src_id 1 --wavdir [path/to/testset_1]

2. WaveNet

2.1. Prepare your data again (based on data extracted for VAE-GAN).

cd ../../data_prep
python --dataset ../voice_conversion/data/data_[name] --outdir ../wavenet_vocoder/egs/gaussian/data --tag [name]

2.2. Preprocess your data again (based on WaveNet specs this time).

cd ../wavenet_vocoder/egs/gaussian
spk="[name]_[id]" ./ --stage 1 --stop-stage 1

2.3. Train a wavenet vocoder.

spk="[name]_[id]" hparams=conf/[name].json ./ --stage 2 --stop-stage 2 

2.4. Infer style transferred reconstructions to improve their perceptual quality.

spk="[name]_[id_2]" inferdir="[expname]_[epoch]_G[id_2]_S[id_1]" hparams=conf/flickr.json ./

3. FAD

3.0. Download the VGGish model pretrained on AudioSet.

cd ../../../fad
mkdir -p data
curl -o data/vggish_model.ckpt

3.1. Create csvs for referencing files of timbre sets (real train set, then fake test set, both of same target timbre)

ls --color=never ../wavenet_vocoder/egs/gaussian/data/[name]_[id_2]/train_no_dev/*.wav  > test_audio/[name]_[id_2].csv
ls --color=never ../wavenet_vocoder/egs/gaussian/out/[name]_[id_2]_[expname]_[epoch]_G[id_2]_S[id_1]/*_gen.wav > test_audio/[name]_[id_2]_[expname]_[epoch]_G[id_2]_S[id_1].csv

3.2. Embed each of the timbre sets with VGGish

mkdir -p stats
python -m frechet_audio_distance.create_embeddings_main  --input_files test_audio/[name]_[id_2].csv \
                                                        --stats stats/[name]_[id_2]_stats

python -m frechet_audio_distance.create_embeddings_main  --input_files test_audio/[name]_[id_2]_[expname]_[epoch]_G[id_2].csv \
                                                        --stats stats/[name]_[id_2]_[expname]_[epoch]_G[id_2]_S[id_1]_stats

3.3. Compute Frechet Distance between stats of the real and generated.

python -m frechet_audio_distance.compute_fad --background_stats stats/[name]_[id_2]_stats \
                                             --test_stats stats/[name]_[id_2]_[expname]_[epoch]_G[id_2]_S[id_1]_stats

Pretrained Models

With respect to the current data preperation set up, the following one-to-one VAE-GANs and specific vocoders were trained:

Model Flickr URMP
VAE-GAN link link
WaveNet link link

Pretrained VAE-GAN

  1. Create directory voice_conversion/src/saved_models/initial.
  2. Drag .pth files to that directory.
  3. Call with --model_name initial --epoch 99 for inference (with epoch 490 for URMP).


Pretrained WaveNet

  1. Create directory wavenet_vocoder/egs/gaussian/exp/
  2. Drag the folder such as flickr_1_train_no_dev_flickr into that directory.
  3. Drag the meanvar.joblib file within the folder to a new directory following wavenet_vocoder/egs/gaussian/dump/[spk]/logmelspectrogram/org - where [spk] corresponds to flickr_1 for example.
  4. Call ./ with appropriate arguments such as spk="flickr_1" inferdir="initial_99_G1_S2".
