Tensorflow implementation of DeepMind's Tacotron-2. A deep neural network architecture described in this paper: Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions
Note:
Tacotron-2
├── datasets
├── en_UK (0)
│ └── by_book
│ └── female
├── en_US (0)
│ └── by_book
│ ├── female
│ └── male
├── LJSpeech-1.1 (0)
│ └── wavs
├── logs-Tacotron (2)
│ ├── eval_-dir
│ │ ├── plots
│ │ └── wavs
│ ├── mel-spectrograms
│ ├── plots
│ ├── pretrained
│ └── wavs
├── logs-Wavenet (4)
│ ├── eval-dir
│ │ ├── plots
│ │ └── wavs
│ ├── plots
│ ├── pretrained
│ └── wavs
├── papers
├── tacotron
│ ├── models
│ └── utils
├── tacotron_output (3)
│ ├── eval
│ ├── gta
│ ├── logs-eval
│ │ ├── plots
│ │ └── wavs
│ └── natural
├── wavenet_output (5)
│ ├── plots
│ └── wavs
├── training_data (1)
│ ├── audio
│ ├── linear
│ └── mels
└── wavenet_vocoder
└── models
The previous tree shows the current state of the repository (separate training, one step at a time).
Note:
The model described by the authors can be divided in two parts:
To have an in-depth exploration of the model architecture, training procedure and preprocessing logic, refer to our wiki
To have an overview of our advance on this project, please refer to this discussion
since the two parts of the global model are trained separately, we can start by training the feature prediction model to use his predictions later during the wavenet training.
first, you need to have python 3 installed along with Tensorflow.
next you can install the requirements. If you are an Anaconda user: (else replace pip with pip3 and python with python3)
pip install -r requirements.txt
We tested the code above on the ljspeech dataset, which has almost 24 hours of labeled single actress voice recording. (further info on the dataset are available in the README file when you download it)
We are also running current tests on the new M-AILABS speech dataset which contains more than 700h of speech (more than 80 Gb of data) for more than 10 languages.
After downloading the dataset, extract the compressed file, and place the folder inside the cloned repository.
Before running the following steps, please make sure you are inside Tacotron-2 folder
cd Tacotron-2
Preprocessing can then be started using:
python preprocess.py
dataset can be chosen using the --dataset argument. If using M-AILABS dataset, you need to provide the language, voice, reader, merge_books and book arguments for your custom need. Default is Ljspeech.
Example M-AILABS:
python preprocess.py --dataset='M-AILABS' --language='en_US' --voice='female' --reader='mary_ann' --merge_books=False --book='northandsouth'
or if you want to use all books for a single speaker:
python preprocess.py --dataset='M-AILABS' --language='en_US' --voice='female' --reader='mary_ann' --merge_books=True
This should take no longer than a few minutes.
To train both models sequentially (one after the other):
python train.py --model='Tacotron-2'
or:
python train.py --model='Both'
Feature prediction model can separately be trained using:
python train.py --model='Tacotron'
checkpoints will be made each 250 steps and stored under logs-Tacotron folder.
Naturally, training the wavenet separately is done by:
python train.py --model='WaveNet'
logs will be stored inside logs-Wavenet.
Note:
To synthesize audio in an End-to-End (text to audio) manner (both models at work):
python synthesize.py --model='Tacotron-2'
For the spectrogram prediction network (separately), there are three types of mel spectrograms synthesis:
python synthesize.py --model='Tacotron' --mode='eval'
python synthesize.py --model='Tacotron' --GTA=False
python synthesize.py --model='Tacotron' --GTA=True
Synthesizing the waveforms conditionned on previously synthesized Mel-spectrograms (separately) can be done with:
python synthesize.py --model='WaveNet'
Note:
Pre-trained models and audio samples will be added at a later date. You can however check some primary insights of the model performance (at early stages of training) here.