A Text-to-Speech Transformer in TensorFlow 2
Implementation of a non-autoregressive Transformer based neural network for Text-to-Speech (TTS).
This repo is based, among others, on the following papers:
Our pre-trained LJSpeech model is compatible with the pre-trained vocoders:
(older versions are available also for WaveRNN)
For quick inference with these vocoders, checkout the Vocoding branch
Being non-autoregressive, this Transformer model is:
These samples' spectrograms are converted using the pre-trained MelGAN vocoder.
Try it out on Colab:
Make sure you have:
Install espeak as phonemizer backend (for macOS use brew):
sudo apt-get install espeak
Then install the rest with pip:
pip install -r requirements.txt
Read the individual scripts for more command line arguments.
Use our pre-trained model (with Griffin-Lim) from command line with
python predict_tts.py -t "Please, say something."
Or in a python script
from data.audio import Audio
from model.factory import tts_ljspeech
model = tts_ljspeech()
audio = Audio.from_config(model.config)
out = model.predict('Please, say something.')
# Convert spectrogram to wav (with griffin lim)
wav = audio.reconstruct_waveform(out['mel'].numpy().T)
You can specify the model step with the --step
flag (CL) or step
parameter (script).
Steps from 60000 to 100000 are available at a frequency of 5K steps (60000, 65000, ..., 95000, 100000).
IMPORTANT: make sure to checkout the correct repository version to use the API.
Currently 493be6345341af0df3ae829de79c2793c9afd0ec
You can directly use LJSpeech to create the training dataset.
config/training_config.yaml
to create MelGAN or HiFiGAN compatible models
data_config_wavernn.yaml
in config/training_config.yaml
to create models compatible with WaveRNN config/training_config.yaml
edit the paths to point at your dataset and log foldersPrepare a folder containing your metadata and wav files, for instance
|- dataset_folder/
| |- metadata.csv
| |- wavs/
| |- file1.wav
| |- ...
if metadata.csv
has the following format
wav_file_name|transcription
you can use the ljspeech preprocessor in data/metadata_readers.py
, otherwise add your own under the same file.
Make sure that:
data_name
field in training_config.yaml
.metadata_path
in training_config.yaml
Change the --config
argument based on the configuration of your choice.
python create_training_data.py --config config/training_config.yaml
This will populate the training data directory (default transformer_tts_data.ljspeech
).
python train_aligner.py --config config/training_config.yaml
First use the aligner model to create the durations dataset
python extract_durations.py --config config/training_config.yaml
this will add the durations.<session name>
as well as the char-wise pitch folders to the training data directory.
python train_tts.py --config config/training_config.yaml
training_config.yaml
--reset_dir
(both) or --reset_logs
, --reset_weights
tensorboard --logdir /logs/directory/
From command line with
python predict_tts.py -t "Please, say something." -p /path/to/weights/
Or in a python script
from model.models import ForwardTransformer
from data.audio import Audio
model = ForwardTransformer.load_model('/path/to/weights/')
audio = Audio.from_config(model.config)
out = model.predict('Please, say something.')
# Convert spectrogram to wav (with griffin lim)
wav = audio.reconstruct_waveform(out['mel'].numpy().T)
Access the pre-trained models with the API call.
Old weights | Model URL | Commit | Vocoder Commit |
---|---|---|---|
ljspeech_tts_model | 0cd7d33 | aca5990 | |
ljspeech_melgan_forward_model | 1c1cb03 | aca5990 | |
ljspeech_melgan_autoregressive_model_v2 | 1c1cb03 | aca5990 | |
ljspeech_wavernn_forward_model | 1c1cb03 | 3595219 | |
ljspeech_wavernn_autoregressive_model_v2 | 1c1cb03 | 3595219 | |
ljspeech_wavernn_forward_model | d9ccee6 | 3595219 | |
ljspeech_wavernn_autoregressive_model_v2 | d9ccee6 | 3595219 | |
ljspeech_wavernn_autoregressive_model_v1 | 2f3a1b5 | 3595219 |
MelGAN and WaveRNN: data normalization and samples' vocoders are from these repos.
Erogol and the Mozilla TTS team for the lively exchange on the topic.
See LICENSE for details.