Open msobhan69 opened 7 years ago
I don't know, honestly. Does the original paper mention anything about it?
Dear @Kyubyong , the paper just said, "since Tacotron generates speech at the frame level, it’s substantially faster than sample-level autoregressive methods", and nothing more.
@msobhan69 Thanks. I believe what the paper said is true, but I don't know if it means Tacotron can generate samples real-time.
I trained model with one sample. The sample results from eval.py are completely noisy but they are recognizable human speech. A sample result generation take the following times (Input: 80 time step(4s)):
encoding_decoding: 23.3s spectrogram2wav(): 0.96s
this delay is more than real-time. Isn't tacotron inference real-time?
@msobhan69 This is the time it takes to transform text to voice once trained? In my case it is taking 2 minutes to generate the voice, how can I reduce this time? Thanks.
I trained model with one sample. The sample results from eval.py are completely noisy but they are recognizable human speech. A sample result generation take the following times (Input: 80 time step(4s)):
encoding_decoding: 23.3s spectrogram2wav(): 0.96s
this delay is more than real-time. Isn't tacotron inference real-time?