Is tacotron inference real-time?

Kyubyong / tacotron

A TensorFlow Implementation of Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model

Apache License 2.0

1.83k stars 436 forks source link

Is tacotron inference real-time? #9

Open msobhan69 opened 7 years ago

msobhan69 commented 7 years ago

I trained model with one sample. The sample results from eval.py are completely noisy but they are recognizable human speech. A sample result generation take the following times (Input: 80 time step(4s)):

encoding_decoding: 23.3s spectrogram2wav(): 0.96s

this delay is more than real-time. Isn't tacotron inference real-time?

Kyubyong commented 7 years ago

I don't know, honestly. Does the original paper mention anything about it?

msobhan69 commented 7 years ago

Dear @Kyubyong , the paper just said, "since Tacotron generates speech at the frame level, it’s substantially faster than sample-level autoregressive methods", and nothing more.

Kyubyong commented 7 years ago

@msobhan69 Thanks. I believe what the paper said is true, but I don't know if it means Tacotron can generate samples real-time.

edwargl7 commented 5 years ago

I trained model with one sample. The sample results from eval.py are completely noisy but they are recognizable human speech. A sample result generation take the following times (Input: 80 time step(4s)):

encoding_decoding: 23.3s spectrogram2wav(): 0.96s

this delay is more than real-time. Isn't tacotron inference real-time?

@msobhan69 This is the time it takes to transform text to voice once trained? In my case it is taking 2 minutes to generate the voice, how can I reduce this time? Thanks.