Some questions about implementation

TOGA101 commented 8 months ago

Since the structure of the LibriSpeech dataset is root/subset/speaker/chapter/file, should https://github.com/bshall/knn-vc/blob/7b59579524f9fdbfe5e92952b7bec70b6253e084/prematch_dataset.py#L57 be modified to uttrs_from_same_spk = sorted(list(path.parent.parent.rglob('**/*.flac')))to return other utterances of the same speaker?
Since matching and synthesis do not necessarily use the same features, can I use ASR features (like from Whisper) for matching and its corresponding original Mel spectrogram for synthesis?

bshall commented 8 months ago

Hi @TOGA101, thanks for the questions.

Yes, you are correct that looks like a bug. Although, it might actually help a bit since prematching is done with less reference data and the style might be more consistent within a chapter. I don't think it'll make too much difference in practice though. If you want to submit a PR with the fix I'd be happy to merge it or I can do it at some point?
You can match and synthesize on different features. I have tried matching on WavLM features and synthesizing on spectrogram frames. It works, but it seems to cause more boundary artefacts than with the WavLM features. I haven't experimented with Whisper features but I imagine it'll work as well. If you try any of these ideas it'd be great to hear your results.

TOGA101 commented 8 months ago

OK, thanks.

bshall / knn-vc