Kyubyong / expressive_tacotron

Tensorflow Implementation of Expressive Tacotron
197 stars 33 forks source link
speech-synthesis speech-to-text tacotron

A TensorFlow Implementation of Expressive Tacotron

This project aims at implementing the paper, Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron, to verify its concept. Most of the baseline codes are based on my previous Tacotron implementation.

Requirements

Data

Because the paper used their internal data, I train the model on the LJ Speech Dataset

LJ Speech Dataset is recently widely used as a benchmark dataset in the TTS task because it is publicly available. It has 24 hours of reasonable quality samples.

Training

Sample Synthesis

I generate speech samples based on the same script as the one used for the original web demo. You can check it in test_sents.txt.

Samples

16 sample sentences in the first chapter of the original web demo are collected for sample synthesis. Two audio clips per sentence are used for prosody embedding--reference voice and base voice. Mostly, those two are different in terms of gender or region. The samples below look like the following:

Check out the samples at each steps.

Analysis

Notes