Deep Convolution Text to Speech

This is an implementation of the paper "Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention" https://arxiv.org/abs/1710.08969

The code is based on the following implementations

The model trains "text2mel" & "SSRN" seperately through trainmel.py & trainmag.py respectively You need to download the LJSpeech dataset available at https://keithito.com/LJ-Speech-Dataset/

Audio Samples

You can listen to Audio Samples

Pre-Trained models can be downloaded Here

prepare the dataset

First, you have to prepare dataset. If you want to use the LJSpeech dataset, you can use the following commands.

$ wget http://data.keithito.com/data/speech/LJSpeech-1.0.tar.bz2
$ tar xvf LJSpeech-1.0.tar.bz2
$ python prepro.py

train the Text2Mel network

$ python trainmel.py

during training you can review the output (by default every 200 minibatches) it dumps the first two examples in the batch into mel0.png & mel1.png as well view the learned attention through a0.png & a1.png

MEL

Attention

train the SSRN network

$ python trainmag.py

during training you can view the output through mag0.png & mag1.png, which compares the learned spectrogram with the groung truth.

MAG

Synthesize

to synthesize a new sentance use:

$ python synth.py --text "sentance to synthesize" --file output.wav

Demo web server

You can run a demo web server to do TTS by running

$ python server.py

this uses Flask framework to run the demo

eazhary / dctts2

readme