Open jaron opened 7 years ago
(1) "is this a result of the 2 million training steps ": Maybe. I am checking this. but, does one step correspond to one batch(batch_size=32) with 32 utterance? If so, 2M steps will need a huge amout of time. (2) "Or is the silence achieved by some other post-processing": According to the paper, no other post-processing is used. CBHG is the post-processing network.
The published samples seem to have very low background noise - is this a result of the 2 million training steps mentioned in the paper progressively reducing the non-signal parts of the output to silence?
Or is the silence achieved by some other post-processing, like a denoising autoencoder or a low-pass filter?
What would still need to be implemented to enable this code to generate natural sounding, non-robotic speech? I'd be interested to hear your thoughts, and helping out if I can.