Kyubyong / tacotron

A TensorFlow Implementation of Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model
Apache License 2.0
1.83k stars 435 forks source link

Generating natural speech, reducing noise #44

Open jaron opened 7 years ago

jaron commented 7 years ago

The published samples seem to have very low background noise - is this a result of the 2 million training steps mentioned in the paper progressively reducing the non-signal parts of the output to silence?

Or is the silence achieved by some other post-processing, like a denoising autoencoder or a low-pass filter?

What would still need to be implemented to enable this code to generate natural sounding, non-robotic speech? I'd be interested to hear your thoughts, and helping out if I can.

sonach commented 7 years ago

(1) "is this a result of the 2 million training steps ": Maybe. I am checking this. but, does one step correspond to one batch(batch_size=32) with 32 utterance? If so, 2M steps will need a huge amout of time. (2) "Or is the silence achieved by some other post-processing": According to the paper, no other post-processing is used. CBHG is the post-processing network.