bshall / knn-vc

Voice Conversion With Just Nearest Neighbors
https://bshall.github.io/knn-vc/
Other
428 stars 64 forks source link

Question about the Used Hardware #37

Closed CCMaure closed 1 month ago

CCMaure commented 1 month ago

Thanks for the great work. Knn-VC produces great results without even the need of training. One thing I noticed is, I canned use more than 3 minutes of refernce audio. If I use like 5 minutes of audio, pytorch tries to allocate 15 GB of GPU memory. I tried it once with 11 minutes of refernce audio, but than over 70 GB of memory space is needed. What kind of GPU did you use for interfernce? Is it normal that so much memory is requested, or is there an error in how I use the toolbox? Do you have any usefull tips on how the audio should be provided for the framework? E.g. file format, frequency, one long file or sevreral tiny snippets ...

Would be great to hear from you.

Kind regards Mr Maure

RF5 commented 1 month ago

Hi @CCMaure

The current code performs default inference using WavLM to obtain the matching and query set features. Being a transformer model, WavLM's memory requirements scales with the square of the audio length, which can easily outrun memory limits of your system. But, what you can do to solve this is to chunk your audio into smaller e.g. 1 minute chunks, compute the WavLM features, then concatenate the result back together again. The result should be nearly identical, but now the memory requirement will be much less since you aren't doing WavLM inference on long pieces of audio.

What kind of GPU did you use for interfernce?

All our testing was done on a single RTX 2070 GPU with 8GB of VRAM. Since each audio in librispeech (our evaluation dataset) was less than 30s, the current code was used without modification to find the WavLM features of all the audios and then concatenate features from the same speaker together to construct matching sets of the desired size.

Do you have any usefull tips on how the audio should be provided for the framework?

My best advice would be to simply chunk the audio into 1 minute chunks, compute the matching set on all of them with matching_set = knn_vc.get_matching_set([path_of_1st_min, path_of_2nd_min, ...]) And then perform knn matching and inference as usual on the readme.

Hope that helps!