The synthesis rate is slow

fatchord / WaveRNN

WaveRNN Vocoder + TTS

https://fatchord.github.io/model_outputs/

MIT License

2.14k stars 697 forks source link

The synthesis rate is slow #89

Closed primejava closed 5 years ago

primejava commented 5 years ago

I set sample_rate is 48000， and my gpu is 16G，but It takes about a minute to synthesize five seconds of audio.and，when i train it ，on the value of loss drops to 1.3, it doesn't drop any more. Here are the hyperparameters I set up。can you tell me anything is error？ thank you very much。

sample_rate = 48000 n_fft = 4096 fft_bins = n_fft // 2 + 1 num_mels = 160 hop_length = int(sample_rate 0.0125) # 12.5ms win_length = int(sample_rate 0.05) # 50ms fmin = 125 min_level_db = -120 ref_level_db = 20 seq_len = hop_length * 5

model = Model(rnn_dims=512, fc_dims=512, bits=bits, pad=2, upsample_factors=(5, 5, 24), feat_dims=160, compute_dims=128, res_out_dims=128, res_blocks=10).to(device)

oytunturk commented 5 years ago

I'm not sure about slowness but it could simply be due to using a higher sampling rate, x2 mels, x2 fft-size. I'd first try num_mels=80 and n_fft=2048 to see if that helps with speed. Also, why do you use fmin=125 Hz? It's probably too high and removes valuable f0 and first formant information from the spectrum. That might explain why loss doesn't drop.

primejava commented 5 years ago

@oytunturk beacuse Sample rate of my voice dataset is 48000，so I set the above parameters（such as fmin = 125） in Tacotron2（https://github.com/Rayhane-mamah/Tacotron-2）.Then I set the same parameters in this repo。I will try to use the default parameters.Thank you for your help.

oytunturk commented 5 years ago

I don't think removing all frequencies under 125Hz is a wise choice for speech no matter what the sampling rate is.

bayesrule commented 5 years ago

@oytunturk but some papers did this (e.g. the original Tacotron2 paper cut freqs below 125Hz and above 7.6kHz), it is assumed that for high-pitched female voice, 125Hz might be a safe bottom.

oytunturk commented 5 years ago

Correct, it might be OK for high-pitched female voices and probably when the recordings have low frequency noise as in some audiobook recordings. It doesn't seem to work well for male or lower pitched female voices. Overall, I think it's worth considering as a hyperparameter to tune if you are facing quality and accuracy issues.

primejava commented 5 years ago

i had try it again by use default hyperparameter 。now at 195k step ，loss is 1.416 ，it is worse。(╥╯^╰╥)