azraelkuan / FFTNet

FFTNet: a Real-Time Speaker-Dependent Neural Vocoder
64 stars 10 forks source link

About the inference speed #2

Closed Maxxiey closed 6 years ago

Maxxiey commented 6 years ago

Hi, thanks for your work, I've got a problem during training when I set the batch_size bigger than 1:

Cannot batch tensors with different shapes in component 0. First element had shape [52480] and element 1 had shape [47872].

It seems like that the different length of wav is the reason, so I set the batch_size to 1 later and the problem does not show again. But this adjustment is just for solving the problem, I will never use batch_size = 1 in training, so do you have any idea on how to fix this, thank you~

PS: in modules.py line 159, do you mean to use tf.nn.leaky_relu(), cause there is no alpha in tf.nn.relu().

Maxxiey commented 6 years ago

I went through the code again and I figured out the reason. Problem solved, closing this issue now.

Maxxiey commented 6 years ago

I have trained this model for over 100k iters, it is surprisingly fast, but when I try to synthesize a wav file, the inference is not as fast as I expect. For a 17s wav, it takes ~20m to finish synthesizing, is there anyone get a better preformance?

azraelkuan commented 6 years ago

sorry for late reply, i also found that the buffer cannot accelerate the generation, may be we need to write a cuda op? i also test other repos, the speed is very slow.

Maxxiey commented 6 years ago

@azraelkuan Thank you very much.

I tried some other repos too, same low speed, guess we miss the trick to implement fast generation... oh, by the way, could you please tell me why using lws in preprocessing the wavform in your repo, what's the difference between lws.stft and librosa.stft, I tried to train your model on the mels extracted by using librosa, but I did not get good results (totally nothing but noise) and I suspect that it has something to do with the data preprocess.

Thanks~ max

azraelkuan commented 6 years ago

there is no much difference between lws and librosa stft, lws is a fast way to do stft, may be u should check the frame length and hop length?

Maxxiey commented 6 years ago

Okay, I will check the hyperparams, closing this issue now, thanks for the quick reply.

Maxxiey commented 6 years ago

@azraelkuan Hello again, I am a little confused about the following codes in cmu_arctic.py

if hparams.use_injected_noise: noise = np.random.normal(0.0, 1.0 / hparams.quantize_channels, wav.shape) wav += noise ... if hparams.rescaling: wav = wav / np.abs(wav).max() * hparams.rescaling_max

According to my humble understanding, the first part injects noise into raw wav and the second part is actually doing a normalization, which makes the "value" of the wav fall in [-1,1].

However, if I am getting it right, the np.abs(wav).max() varies, since it is very likely that two different clips of wav have different max value. So, if we add noise first, then norm, the distribution of noise may change from N(0, 1/256) to something else.

I think the right order is to norm the wav first and then apply noise injection, preventing the distribution of noise from being changed.

What is your opinion?

Thanks in advance~ max

azraelkuan commented 6 years ago

Yes, i think u are right. Thanks.

Maxxiey commented 6 years ago

Hey, quick update here, I tried to change the order of processing, but it is really embarrassing to find that the loss went to nan immediately, since I am focusing on something else right now, I do not have time to figure out why, if you have any idea, please let me know~ thanks