bshall / knn-vc

Voice Conversion With Just Nearest Neighbors
https://bshall.github.io/knn-vc/
Other
431 stars 64 forks source link

Some question about features KNN #35

Closed huutuongtu closed 4 months ago

huutuongtu commented 4 months ago

Have you tried using Wavlm, which has been fine-tuned on an ASR dataset, to extract semantic features for querying KNN instead of directly using SSL features? Using KNN to obtain timestamps only, then using the timestamps of the reference Wavlm SSL to generate the output.

RF5 commented 4 months ago

Hi @huutuongtu , yep that might work quite well! It should be fairly easy to experiment with that without any retraining as well, since all you need to do is pass new arguments to the y_seq: Tensor, matching_set: Tensor, synth_set: Tensor arguments of the match(...) function. Namely, you can keep the synth_set the default WavLM SSL features, but for the y_seq (query sequence) and matching_set (matching set) you can use the features from ASR fine-tuned WavLM.

If you try it out, I'd be interested to hear how it goes!

huutuongtu commented 4 months ago

Hmm, I've tried using the pretrained ASR model wav2vec2-base-100h from Facebook and wavlm patrickvonplaten/wavlm-libri-clean-100h-base-plus, but the results aren't satisfactory (too noisy) :D. Perhaps retraining the vocoder with semantic features could improve the results. Here is the link to some samples: https://drive.google.com/drive/folders/19lVYEi20iOzhxDXv1fiXA0PXtlvV7aHx?usp=sharing