Hugging Face Space: Aliasing in audio

brentspell / hifi-gan-bwe

Unofficial implementation of HiFi-GAN+ from the paper "Bandwidth Extension is All You Need" by Su, et al.

MIT License

206 stars 26 forks source link

Hugging Face Space: Aliasing in audio #1

Closed torridgristle closed 2 years ago

torridgristle commented 2 years ago

tl;dr I suspect the input audio is getting resampled without a lowpass filter or otherwise improperly and then it's going to hifi-gan-bwe in the Hugging Face Space demo, uncertain if issue exists in repo.

I don't know if this issue is present in this code or just the Hugging Face interface, but in the Hugging Face Space demo there appears to be an issue with aliasing. If I feed it the same sound at two sample rates, one at the original sample rate of 20825hz (E-mu Drumulator sample data if you were curious about the weird value) and another resampled to 48khz, they'll be very different. The 20825hz version clearly has the vertically reflected spectrogram of aliasing, while the 48khz version doesn't. I tried 24khz as well since it seemed to be a sample rate used in training, but that also seemed like it might have aliasing.

sounds_input_output.zip

torridgristle commented 2 years ago

16vs48 7000 to 20 hz test.zip

Did a test with a sine sweep from 7000 to 20 hz at a 16khz sample rate and at a 48khz sample rate (resampled from the 16khz, no aliasing present in original) to get a better visualization of any possible aliasing.

brentspell commented 2 years ago

Thanks for the samples and the report! I ran a quick test in colab and I was able to see exactly what you're talking about, so I don't think the problem is restricted to the huggingface demo. I want to dig into this a bit further, then I'll share my colab notebook.

One thing that may be going on here is that the original paper uses simple linear interpolation (instead of band-limited interpolation with a low-passed filter) to upsample the signal before running it through the wavenet layers. This is a common approach for many super-resolution models, and ideally the model would learn to clean up any aliasing or other artifacts introduced during interpolation. However, it may be better to use a real audio resampler on the frontend before passing the signal to the network. If I get some time, I'll try this out and see if it produces better results for these examples.

Thanks again for the feedback!

brentspell commented 2 years ago

I can't say it's a complete solution, but the reflections in the sweep appear to be reduced (especially below Nyquist of the original sample) by substituting in a bandlimited interpolator for the old upsampler. This change should be in the latest version of the code, and I published a bwe-10 model using it. The huggingface demo has also been upgraded to this version. I think further improvements would have to be made through modeling or by training on more diverse audio datasets (not just speech).