In our recent paper we propose Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data.
By explicitly conditioning on rhythm and continuous pitch contours from an audio signal or music score, Mellotron is able to generate speech in a variety of styles ranging from read speech to expressive speech, from slow drawls to rap and from monotonous voice to singing voice.
Visit our website for audio samples.
git clone https://github.com/NVIDIA/mellotron.git
cd mellotron
git submodule init; git submodule update
pip install -r requirements.txt
python train.py --output_directory=outdir --log_directory=logdir
tensorboard --logdir=outdir/logdir
Training using a pre-trained model can lead to faster convergence
By default, the speaker embedding layer is ignored
python train.py --output_directory=outdir --log_directory=logdir -c models/mellotron_libritts.pt --warm_start
python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True
jupyter notebook --ip=127.0.0.1 --port=31337
WaveGlow Faster than real time Flow-based Generative Network for Speech Synthesis.
This implementation uses code from the following repos: Keith Ito, Prem Seetharaman, Chengqi Deng, Patrice Guyot, as described in our code.