fatchord / WaveRNN

WaveRNN Vocoder + TTS
https://fatchord.github.io/model_outputs/
MIT License
2.12k stars 697 forks source link

Computation complexity issue between MOL and RAW #140

Open reasonance1216 opened 4 years ago

reasonance1216 commented 4 years ago

I tried two differenct versions of WaveRNN vocoders where each vocoder has different output node. The first one uses RAW (softmax 512-dim, 9-bit mu-law encoded waveform sample target) and last one uses MOL (mixture of logistics, 30-dim, 16-bit waveform sample target).

After training of the networks is over, I campared the generation time of two different vocoders by using xRT (real time factor). I expected that MOL would be much faster than RAW because of output dims, but result was opposite. In my GPU condition, generation times (xRTs) of MOL and RAW were 2.85 and 1.49 to vocoding same spectrogram (672 frames) using batched generation, respectively. That means computation complexity of MOL is higher than RAW! Is there anybody who explain this?

oytunturk commented 4 years ago

If you check the generation code for MOL, you'll see that it's much more complicated than RAW. It has to sample from a MOL distribution, apply costly functions such as exp, log, etc.

On Mon, Sep 23, 2019 at 9:57 PM reasonance1216 notifications@github.com wrote:

I tried two differenct versions of WaveRNN vocoders where each vocoder has different output node. The first one uses RAW (softmax 512-dim, 9-bit mu-law encoded waveform sample target) and last one uses MOL (mixture of logistics, 30-dim, 16-bit waveform sample target).

After training of the networks is over, I campared the generation time of two different vocoders by using xRT (real time factor). I expected that MOL would be much faster than RAW because of output dims, but result was opposite. In my GPU condition, generation times (xRTs) of MOL and RAW were 2.85 and 1.49 to vocoding same spectrogram (672 frames) using batched generation, respectively. That means computation complexity of MOL is higher than RAW! Is there anybody who explain this?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fatchord/WaveRNN/issues/140?email_source=notifications&email_token=ABMAQJ4AQMEHIST57EJZBR3QLGMZ5A5CNFSM4IZ2WIMKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HNGSDGQ, or mute the thread https://github.com/notifications/unsubscribe-auth/ABMAQJ4I6GL3VT356ZVMT7DQLGMZ5ANCNFSM4IZ2WIMA .

reasonance1216 commented 4 years ago

Of coursely I checked the generation code for RAW and MOL, it made me confused.

I know that sampling from a MOL dist is complicated than it looks like. But as I understand, RAW seems to be more complicated because MOL's dimension (30-dim) is way small than RAW's (512-dim). Checking MOL generation code in detail, the overall process can be divided into two phases. The first phase processes picking a logistic distribution out of M (10 in my implementation). This contains M2logarithm operations. Next, sampling from selected logistic parameters is executed, 1-dim sample value is obtained containing 2logarithm and 1exponential operations.

While, in RAW generation code, 512*exponential operations due to softmax are required. Computing all the operations, computation complexity of RAW looks much heavier.

Did I miss something? It would be so thankful if somebody corrects my misunderstanding.