Since matching and synthesis do not necessarily use the same features, can I use ASR features (like from Whisper) for matching and its corresponding original Mel spectrogram for synthesis?
Yes, you are correct that looks like a bug. Although, it might actually help a bit since prematching is done with less reference data and the style might be more consistent within a chapter. I don't think it'll make too much difference in practice though. If you want to submit a PR with the fix I'd be happy to merge it or I can do it at some point?
You can match and synthesize on different features. I have tried matching on WavLM features and synthesizing on spectrogram frames. It works, but it seems to cause more boundary artefacts than with the WavLM features. I haven't experimented with Whisper features but I imagine it'll work as well. If you try any of these ideas it'd be great to hear your results.
uttrs_from_same_spk = sorted(list(path.parent.parent.rglob('**/*.flac')))
to return other utterances of the same speaker?