bshall / knn-vc

Voice Conversion With Just Nearest Neighbors
https://bshall.github.io/knn-vc/
Other
431 stars 64 forks source link

Choice for k #17

Closed tebin closed 1 year ago

tebin commented 1 year ago

Hi, the paper goes over the choice for k very briefly, so I was wondering if you could share some results of the preliminary experiments. It says "when more reference audio is available (e.g. ≥10 mins), the conversion quality may even be improved by using larger values of k (in the order of k = 20)"; does the quality keeps getting better past k=20, or does it start degrading after certain point? Also, did you try k=1, which happens to be the approach this project uses? If so, what were the results?

RF5 commented 1 year ago

Hi @tebin , thanks for your interest!

Hi, the paper goes over the choice for k very briefly, so I was wondering if you could share some results of the preliminary experiments.

For the preliminary experiments we didn't perform thorough evaluations on each case for k, but we did experiment with several choices of k (from k=1 up to k=100 in the extreme case). While the quality is similar for many of the cases, we found that k near 4 to 20 yielded the best results.

It says "when more reference audio is available (e.g. ≥10 mins), the conversion quality may even be improved by using larger values of k (in the order of k = 20)"; does the quality keeps getting better past k=20, or does it start degrading after certain point?

This largely depends on the size of your matching set. If you have several hours of data from your reference speaker, using values of k upwards of 100 will still produce good results (but may change the pitch contour or prosody slightly since you are averaging over more features). But, if you only have a handful of minutes of reference data, using smaller values of k will yield better results (as there are not enough features from the same phone/biphone to average large values of k features together and still produce the correct phonetic content).

Also, did you try k=1, which happens to be the approach this project uses? If so, what were the results?

Yes we did try this, and it works fairly well (might even be optimal for very restricted amounts of reference data). But, when more reference data is available, using values greater than 1 yields better results. The current code makes it fairly easy to experiment with different values of k, so if you notice any useful trends in the setting of k versus reference data, we'd be interested to hear it!

I hope that explains things a bit more :)

tebin commented 1 year ago

Thank you for the response! That was very helpful.