Open bshall opened 5 years ago
@bshall Well, the main reason for simplified upsampling was to improve data flow. The upsampling part contains a 5 tap convolution, which requires padding the input mels on both sides with at least 2 empty frames on each side. It adds significant amount of work when doing parallel synthesis (by splitting the input mels in time and synthesizing in parallel - each piece has to be padded), and one has to be very careful when stitching padded waveform pieces together.
It turned out that network based upsampling is actually shifting the mels in time a bit, which simple interpolation wasn't doing. This resulted in slightly lower quality speech.
Keep in mind that upsampling is a tiny part of overall timing. Most of the work is done in RNN and post-net FC layers.
I'm starting to thing about implementing streaming synthesis for the C++ library (i.e. don't wait for all the mel frames to be ready, instead generate as mel frames are added), so I may take another look at upsampling to avoid doing convolutions.
Thanks for the response @geneing. Yeah, streaming synthesis would be really cool. I was wondering whether simple "nearest" upsampling would be good enough to replace the upsampling network.
Hi @geneing I was wondering if you made any progress with the streaming synthesis. I am trying to do something similar to better estimate the inference time/First Time To Response, and the achieved improvements using very helpful techniques that you suggested.
Hi @geneing, thanks for all your hard work! I was wondering why you decided to abandon the simplified upsampling in your model_simplification branch. Was the audio quality significantly worse?