Closed kalomaze closed 7 months ago
It's because the pretrained feature encoder we used only support 16khz. In Automatic Speech Recognition (ASR) task, there is no pretrained feature encoder use audio exceeding 16khz as input. Maybe increasing input audio sample rate can't increase the accurary of ASR.
Hi RVC-Boss, If the pretrained feature encoder only supports 16kHz, why didn't you use the vocoder (HiFi GAN) to generate 16kHz audio? On the main branch, I only found vocoder models that support audio generation in the 40kHz and 48kHz formats. In my understanding, since encoder only supports 16kHz, this method will block or filter audio component above 8kHz. And the vocder used the encoder feaure which olny contained 8kHz information. My question is why you used the vocoder (HiFi GAN) to generate audio at 40kHz/48kHz instead of using 16kHz. BTW, I notice that outcome audio from 40kHz/48kHz vocoders only contaion audio information below 8kHz.
This issue was closed because it has been inactive for 15 days since being marked as stale.
I notice 16khz seems to be hard programmed into the code for the downsampling phase. I think supporting more modular sample rates for inference audio would make sense. With librosa the standard for handling audio is more like 22khz instead of 16khz. Migrating to this standard may reduce inference time but there are potential gains, especially with pronouncing words quickly. 16khz, to me, seems slightly too low to support the entire range of human speech.
https://analyticsindiamag.com/hands-on-guide-to-librosa-for-handling-audio-files/#:~:text=The%20sampling%20rate%20is%20nothing,by%20your%20desired%20sampling%20rate