Will this work for singing voice conversion (svc)?

billnye2 commented 9 months ago

Great repo! Ran some tests with it and it sounds good for speech, but the limited testing I did for singing didn't sound too great. Is this expected / is there a way to adapt it to work well with singing? Perhaps switch it to use NSF-HiFiGAN as so-vits-svc does?

P.S. I especially like the zero-shot any-to-any nature of this model, not sure if there are other projects out there now for zero shot svc.

RF5 commented 9 months ago

Hi @billnye2 , thanks for your comments :). Some thoughts:

Yep we also found kNN-VC to not do super well with singing, especially for more expressive / melodic songs.
It is largely expected, since the two trained parts of kNN-VC (the WavLM encoder and HiFiGAN vocoder) are both only trained on English librispeech, which is fairly monotone. Both of these hurt quality when presented with singing inputs. The kNN part is quite agnostic to singing vs non-singing vs non-human sounds, so likely the main limitation is from the feature encoding and vocoding side.
To fix the WavLM side, I think one would need to retrain WavLM with some singing data added, so that the features it produces can better represent singing audios.
To fix the HiFiGAN vocoding side, using NSF-HiFiGAN might definitely improve things. I imagine what would be required would be to train an NSF-HiFiGAN model to vocode WavLM features (instead of spectrograms), then it can be directly used with kNN-VC.

Hopefully in the not too distant future, we will be able to generalize the performance of kNN-VC and other models. Thank's again for your interest in our work!

billnye2 commented 9 months ago

Great support, thank you!

bshall / knn-vc

Will this work for singing voice conversion (svc)? #28