bshall / knn-vc

Voice Conversion With Just Nearest Neighbors
https://bshall.github.io/knn-vc/
Other
431 stars 64 forks source link

Some questions about implementation #31

Closed TOGA101 closed 7 months ago

TOGA101 commented 8 months ago
  1. Since the structure of the LibriSpeech dataset is root/subset/speaker/chapter/file, should https://github.com/bshall/knn-vc/blob/7b59579524f9fdbfe5e92952b7bec70b6253e084/prematch_dataset.py#L57 be modified to uttrs_from_same_spk = sorted(list(path.parent.parent.rglob('**/*.flac')))to return other utterances of the same speaker?
  2. Since matching and synthesis do not necessarily use the same features, can I use ASR features (like from Whisper) for matching and its corresponding original Mel spectrogram for synthesis?
bshall commented 8 months ago

Hi @TOGA101, thanks for the questions.

  1. Yes, you are correct that looks like a bug. Although, it might actually help a bit since prematching is done with less reference data and the style might be more consistent within a chapter. I don't think it'll make too much difference in practice though. If you want to submit a PR with the fix I'd be happy to merge it or I can do it at some point?
  2. You can match and synthesize on different features. I have tried matching on WavLM features and synthesizing on spectrogram frames. It works, but it seems to cause more boundary artefacts than with the WavLM features. I haven't experimented with Whisper features but I imagine it'll work as well. If you try any of these ideas it'd be great to hear your results.
TOGA101 commented 8 months ago

OK, thanks.