as-ideas / TransformerTTS

πŸ€–πŸ’¬ Transformer TTS: Implementation of a non-autoregressive Transformer based neural network for text to speech.
https://as-ideas.github.io/TransformerTTS/
Other
1.13k stars 227 forks source link
axelspringerai deep-learning python tensorflow text-to-speech tts



A Text-to-Speech Transformer in TensorFlow 2

Implementation of a non-autoregressive Transformer based neural network for Text-to-Speech (TTS).
This repo is based, among others, on the following papers:

Our pre-trained LJSpeech model is compatible with the pre-trained vocoders:

(older versions are available also for WaveRNN)

For quick inference with these vocoders, checkout the Vocoding branch

Non-Autoregressive

Being non-autoregressive, this Transformer model is:

πŸ”ˆ Samples

Can be found here.

These samples' spectrograms are converted using the pre-trained MelGAN vocoder.

Try it out on Colab:

Open In Colab

Updates

πŸ“– Contents

Installation

Make sure you have:

Install espeak as phonemizer backend (for macOS use brew):

sudo apt-get install espeak

Then install the rest with pip:

pip install -r requirements.txt

Read the individual scripts for more command line arguments.

Pre-Trained LJSpeech API

Use our pre-trained model (with Griffin-Lim) from command line with

python predict_tts.py -t "Please, say something."

Or in a python script

from data.audio import Audio
from model.factory import tts_ljspeech

model = tts_ljspeech()
audio = Audio.from_config(model.config)
out = model.predict('Please, say something.')

# Convert spectrogram to wav (with griffin lim)
wav = audio.reconstruct_waveform(out['mel'].numpy().T)

You can specify the model step with the --step flag (CL) or step parameter (script).
Steps from 60000 to 100000 are available at a frequency of 5K steps (60000, 65000, ..., 95000, 100000).

IMPORTANT: make sure to checkout the correct repository version to use the API.
Currently 493be6345341af0df3ae829de79c2793c9afd0ec

Dataset

You can directly use LJSpeech to create the training dataset.

Configuration

Custom dataset

Prepare a folder containing your metadata and wav files, for instance

|- dataset_folder/
|   |- metadata.csv
|   |- wavs/
|       |- file1.wav
|       |- ...

if metadata.csv has the following format wav_file_name|transcription you can use the ljspeech preprocessor in data/metadata_readers.py, otherwise add your own under the same file.

Make sure that:

Training

Change the --config argument based on the configuration of your choice.

Train Aligner Model

Create training dataset

python create_training_data.py --config config/training_config.yaml

This will populate the training data directory (default transformer_tts_data.ljspeech).

Training

python train_aligner.py --config config/training_config.yaml

Train TTS Model

Compute alignment dataset

First use the aligner model to create the durations dataset

python extract_durations.py --config config/training_config.yaml

this will add the durations.<session name> as well as the char-wise pitch folders to the training data directory.

Training

python train_tts.py --config config/training_config.yaml

Training & Model configuration

Resume or restart training

Monitor training

tensorboard --logdir /logs/directory/

Tensorboard Demo

Prediction

With model weights

From command line with

python predict_tts.py -t "Please, say something." -p /path/to/weights/

Or in a python script

from model.models import ForwardTransformer
from data.audio import Audio
model = ForwardTransformer.load_model('/path/to/weights/')
audio = Audio.from_config(model.config)
out = model.predict('Please, say something.')

# Convert spectrogram to wav (with griffin lim)
wav = audio.reconstruct_waveform(out['mel'].numpy().T)

Model Weights

Access the pre-trained models with the API call.

Old weights Model URL Commit Vocoder Commit
ljspeech_tts_model 0cd7d33 aca5990
ljspeech_melgan_forward_model 1c1cb03 aca5990
ljspeech_melgan_autoregressive_model_v2 1c1cb03 aca5990
ljspeech_wavernn_forward_model 1c1cb03 3595219
ljspeech_wavernn_autoregressive_model_v2 1c1cb03 3595219
ljspeech_wavernn_forward_model d9ccee6 3595219
ljspeech_wavernn_autoregressive_model_v2 d9ccee6 3595219
ljspeech_wavernn_autoregressive_model_v1 2f3a1b5 3595219

Maintainers

Special thanks

MelGAN and WaveRNN: data normalization and samples' vocoders are from these repos.

Erogol and the Mozilla TTS team for the lively exchange on the topic.

Copyright

See LICENSE for details.