chomeyama / SiFiGAN

Official implementation of the source-filter HiFiGAN vocoder
MIT License
235 stars 35 forks source link

what about the RTF in CPU? #1

Closed Liujingxiu23 closed 2 years ago

Liujingxiu23 commented 2 years ago

I see you claimed that "achieving better voice quality and faster synthesis speed on a single CPU". Can you share a value or the speed compared to the original hifigan?

chomeyama commented 2 years ago

Sorry, I forgot to mention that. The RTFs of the original HiFi-GAN were 0.84 on the CPU and 0.003 on the GPU.

Liujingxiu23 commented 2 years ago

Thank you for your reply!But I am sorry, my expression is wrong,the RTF value of SifiGan in CPU and GPU is?

chomeyama commented 2 years ago

Sorry, but I reported all the RTFs in the demo site and the paper! Please check it! https://chomeyama.github.io/SiFiGAN-Demo/

Liujingxiu23 commented 2 years ago

Sorry for my careless, what if I use mels as the input features instead of mgc and bap? is there any suggestions? my subject is speech and sing synthesis, and mel is used more.

chomeyama commented 2 years ago

From my experience, source-filter-based neural vocoders generally achieve higher performance with vocoder features (disentangled ones) than mel. I consider this because mel contains F0 information, which tends to prevent the separation of the source and filter networks.

Liujingxiu23 commented 2 years ago

@chomeyama Thank you for your reply! I merge to your code to my visinger code which use z, a latent feature, as acoustic parameter to train the vocoder model. The training is in progress, I will compare the generated sample with the original hifigan as well as the hifigan version used in diffsinger. But I change the hop_size to 256. Should I change the parameters the generate dfs? Another question is have you use world parameter to do singing voice conversion? I wonder maybe the wolrd parameter is stable than mels in vc? Is there any code or paper to refer to? Thanks again.

chomeyama commented 2 years ago

I merge to your code to my visinger code

That's nice!

Should I change the parameters the generate dfs?

I suppose the sampling rate is 24 kHz or higher, and the upsampling factors are [8, 8, 2, 2], [8, 4, 4, 2], or something similar to these. If so, I think the default dense factors would be fine because the representable maximum frequencies of PDCNNs will be able to cover the F0 range of singing voices.

Another question is have you use world parameter to do singing voice conversion?

Sorry but I have never tried singing voice conversion and don't know any informative knowledge about that...

Liujingxiu23 commented 2 years ago

I use 16k and a similar config in hifigan for running in cpu in a speed more than real time.
It may spend a relatively long time to compare the results. Thanks again!

Liujingxiu23 commented 2 years ago

My experiments are all done. The original version of hifigan is the worst. For the waves generated by sifigan and the vocoder version used in diffsinger, I can easily tell the differences of auditory sense between them, but as a engineer in AI without musical knowledge , I can not tell which one is better. But one of my colleague whose major is music says sifigan is much better on intonation/pitch.

Then I tried to use sifigan to do stream inference using sliding window with padding on CPU, I found sifigan has a much larger receptive field then the origanl hifigan, maybe the receptive field of source-model is larger, sifigan is not suitable for stream inference . But if use GPU to do normal inference, everything is ok.

Thank you for your great work and sharing again!

chomeyama commented 2 years ago

Great! Thank you for testing sifigan and sharing your informative result!