RussellSB / tt-vae-gan

Timbre transfer with variational autoencoding and cycle-consistent adversarial networks. Able to transfer the timbre of an audio source to that of another.
65 stars 16 forks source link
generative-adversarial-network music speech timbre timbre-transfer variational-autoencoder voice-conversion-gan

Timbre Transfer with VAE-GAN & WaveNet

This pipeline follows and extends the work of Albadawy & Lyu 2020. The work that used this shows (amongst other things) that their proposed voice conversion model is also applicable to context of musical instruments, therefore reforming the conversion to a more generalised audio style - timbre.

You can find the pre-print of this work here. Please be sure to reference it if you use this code for your research:

@misc{sammutbonnici2021timbre,
      title={Timbre Transfer with Variational Auto Encoding and Cycle-Consistent Adversarial Networks}, 
      author={Russell Sammut Bonnici and Charalampos Saitis and Martin Benning},
      year={2021},
      eprint={2109.02096},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

Summary

The implemented pipeline makes use of the following projects (click for orginating repos):

  1. voice_conversion - Performs VAE-GAN style transfer in the time-frequency melspectrogram domain.
  2. wavenet_vocoder - Vocodes melspectrogram output from style transfer model to realistic audio.
  3. fad - Computes Fréchet Audio Distance (using VGGish) to evaluate the quality of wavenet vocoder output.

Index

Demo

Female to Male:

G2_970641406_9a20ee636a_4

Violin to Trumpet:

G1_AuSep_2_vn_32_Fugue

Hardware

Recommended GPU VRAM per model:

Note

Tutorial

0. Setup

0.1. Clone this repo as well as its sub modules for voice_conversion and wavenet_vocoder with git:

git clone https://github.com/RussellSB/tt-vae-gan.git
cd tt-vae-gan 
git submodule init 
git submodule update

0.2. Ensure that your environment has installed the dependencies of the submodules.


1. VAE-GAN

1.0. Download the dataset.

Choose:

1.1. Prepare your data.

Run one of the python commands for extracting timbre files of interest:

cd data_prep
python flickr --dataroot [path/to/flickr_audio/flickr_audio/]  # For Flickr
python urmp --dataroot [path/to/urmp/]  # For URMP

Alternatively, you can use your own dataset. Just set it up so that in voice_conversion/data/data_mydataset you have the following structure:

voice_conversion/data/data_mydataset
├── spkr_1
│   ├── sample.wav
├── spkr_2
│   ├── sample.wav
│   ...
└── spkr_N
    ├── sample.wav
    ...
# The directory under each speaker cannot be nested.

1.2. Preprocess your data

cd ../voice_conversion/src
python preprocess.py --dataset ../data/data_[name]

1.3. Train on your data.

python train.py --model_name [expname] --dataset ../data/data_[name] --n_spkrs 2

1.4. Infer with VAE-GAN and reconstruct raw audio with Griffin Lim.

python inference.py --model_name [expname] --epoch [int] --trg_id 2 --src_id 1 --wavdir [path/to/testset_1]

2. WaveNet

2.1. Prepare your data again (based on data extracted for VAE-GAN).

cd ../../data_prep
python wavenet.py --dataset ../voice_conversion/data/data_[name] --outdir ../wavenet_vocoder/egs/gaussian/data --tag [name]

2.2. Preprocess your data again (based on WaveNet specs this time).

cd ../wavenet_vocoder/egs/gaussian
spk="[name]_[id]" ./run.sh --stage 1 --stop-stage 1

2.3. Train a wavenet vocoder.

spk="[name]_[id]" hparams=conf/[name].json ./run.sh --stage 2 --stop-stage 2 

2.4. Infer style transferred reconstructions to improve their perceptual quality.

spk="[name]_[id_2]" inferdir="[expname]_[epoch]_G[id_2]_S[id_1]" hparams=conf/flickr.json ./infer.sh

3. FAD

3.0. Download the VGGish model pretrained on AudioSet.

cd ../../../fad
mkdir -p data
curl -o data/vggish_model.ckpt https://storage.googleapis.com/audioset/vggish_model.ckpt

3.1. Create csvs for referencing files of timbre sets (real train set, then fake test set, both of same target timbre)

ls --color=never ../wavenet_vocoder/egs/gaussian/data/[name]_[id_2]/train_no_dev/*.wav  > test_audio/[name]_[id_2].csv
ls --color=never ../wavenet_vocoder/egs/gaussian/out/[name]_[id_2]_[expname]_[epoch]_G[id_2]_S[id_1]/*_gen.wav > test_audio/[name]_[id_2]_[expname]_[epoch]_G[id_2]_S[id_1].csv

3.2. Embed each of the timbre sets with VGGish

mkdir -p stats
python -m frechet_audio_distance.create_embeddings_main  --input_files test_audio/[name]_[id_2].csv \
                                                        --stats stats/[name]_[id_2]_stats

python -m frechet_audio_distance.create_embeddings_main  --input_files test_audio/[name]_[id_2]_[expname]_[epoch]_G[id_2].csv \
                                                        --stats stats/[name]_[id_2]_[expname]_[epoch]_G[id_2]_S[id_1]_stats

3.3. Compute Frechet Distance between stats of the real and generated.

python -m frechet_audio_distance.compute_fad --background_stats stats/[name]_[id_2]_stats \
                                             --test_stats stats/[name]_[id_2]_[expname]_[epoch]_G[id_2]_S[id_1]_stats

Pretrained Models

With respect to the current data preperation set up, the following one-to-one VAE-GANs and specific vocoders were trained:

Model Flickr URMP
VAE-GAN link link
WaveNet link link

Pretrained VAE-GAN

  1. Create directory voice_conversion/src/saved_models/initial.
  2. Drag .pth files to that directory.
  3. Call with --model_name initial --epoch 99 for inference (with epoch 490 for URMP).

Notes

Pretrained WaveNet

  1. Create directory wavenet_vocoder/egs/gaussian/exp/
  2. Drag the folder such as flickr_1_train_no_dev_flickr into that directory.
  3. Drag the meanvar.joblib file within the folder to a new directory following wavenet_vocoder/egs/gaussian/dump/[spk]/logmelspectrogram/org - where [spk] corresponds to flickr_1 for example.
  4. Call ./infer.sh with appropriate arguments such as spk="flickr_1" inferdir="initial_99_G1_S2".

Notes