16khz for inference vs 22khz?

RVC-Project / Retrieval-based-Voice-Conversion-WebUI

Easily train a good VC model with voice data <= 10 mins!

MIT License

24.97k stars 3.65k forks source link

16khz for inference vs 22khz? #519

Closed kalomaze closed 7 months ago

kalomaze commented 1 year ago

I notice 16khz seems to be hard programmed into the code for the downsampling phase. I think supporting more modular sample rates for inference audio would make sense. With librosa the standard for handling audio is more like 22khz instead of 16khz. Migrating to this standard may reduce inference time but there are potential gains, especially with pronouncing words quickly. 16khz, to me, seems slightly too low to support the entire range of human speech.

https://analyticsindiamag.com/hands-on-guide-to-librosa-for-handling-audio-files/#:~:text=The%20sampling%20rate%20is%20nothing,by%20your%20desired%20sampling%20rate

RVC-Boss commented 1 year ago

It's because the pretrained feature encoder we used only support 16khz. In Automatic Speech Recognition (ASR) task, there is no pretrained feature encoder use audio exceeding 16khz as input. Maybe increasing input audio sample rate can't increase the accurary of ASR.

tea6329714 commented 1 year ago

Hi RVC-Boss, If the pretrained feature encoder only supports 16kHz, why didn't you use the vocoder (HiFi GAN) to generate 16kHz audio? On the main branch, I only found vocoder models that support audio generation in the 40kHz and 48kHz formats. In my understanding, since encoder only supports 16kHz, this method will block or filter audio component above 8kHz. And the vocder used the encoder feaure which olny contained 8kHz information. My question is why you used the vocoder (HiFi GAN) to generate audio at 40kHz/48kHz instead of using 16kHz. BTW, I notice that outcome audio from 40kHz/48kHz vocoders only contaion audio information below 8kHz.

github-actions[bot] commented 7 months ago

This issue was closed because it has been inactive for 15 days since being marked as stale.