bshall / knn-vc

Voice Conversion With Just Nearest Neighbors
https://bshall.github.io/knn-vc/
Other
431 stars 64 forks source link

Question about WavLM layer choice #13

Closed space-pope closed 1 year ago

space-pope commented 1 year ago

In your paper, you say:

Recent work confirms that later layers give poorer predictions of pitch, prosody, and speaker identity. Based on these observations, we found that using a layer with high correlation with speaker identification – layer 6 in WavLM-Large – was necessary for good speaker similarity and retention of the prosody information from the source utterance.

The reference associated with that passage, though, doesn't seem to examine WavLM-Large, only Base, and my reading of it is that WavLM-Base's earlier layers (0-2) are more correlated with pitch and energy reconstruction, common speaker ID features.

I'm wondering how you came to use layer 6 of the Large model and whether you tried other layers. I'm having trouble locating other research that dives into layerwise feature correlations for these models, so any pointers you can provide are helpful.

Thanks!

RF5 commented 1 year ago

Hi @space-pope , thanks for your interest!

I think the prior context for that passage might help understanding:

In preliminary experiments, we used features from later layers (22, 24, and the mean of the last several layers), which perform well on linear phone recognition tasks [6]. The idea was to improve nearest neighbors mapping by including more content information. However, this led to worse pitch and energy reconstruction. Recent work [15] confirms that later layers give poorer predictions of pitch, prosody, and speaker identity. Based on these observations, we found that using a layer with high correlation with speaker identification – layer 6 in WavLM-Large – was necessary for good speaker similarity and retention of the prosody information from the source utterance.

Here, the two references are meant to be understood together: [6] analyses several tasks for both WavLM-Base+ and WavLM-Large, while [15] considers additional tasks but only analyses the layer-wise contributing from WavLM-Base. From [6] we can see that there is an extremely strong correlation between WavLM-Base+ and WavLM-Large. e.g. if the last few layers of WavLM-Base is highly weighted for a task, we can expect the last few layers of WavLM-Large to also be highly weighted for those same tasks.

So, from [15] we know the later layers of WavLM-Base perform poorly on pitch and energy reconstruction (important aspects of prosody), so taken together with [6] (what we referred to as 'these observations') we can infer that for WavLM-Large the later layers will also still struggle with pitch and energy reconstruction.

And, as you mention and as hinted in the passage, in preliminary experiments we did try several other layers, and found layer 6 to be the best of the settings we tested -- but the other layers also perform reasonable (i.e. none of the layers are completely unusable). While we are not fully certain of the reason for this, we suspect that it is because of the high weight it has for speaker identification [6]. The earlier layers might have better pitch and energy reconstruction, but they are lower-level features, and so yield slightly more artifacts after the k-means operation during vocoding. i.e. we suspect that if pitch is too readily available, then the effective shuffling of WavLM frames after the k-means operation (which distorts the pitch information between adjacent frames) causes the output pitch contour to also be more distorted. However, there is much more room for investigation here as we are not certain of all the effects at play.

Hope that helps a bit with some of the intuition behind why we use layer 6 :)

space-pope commented 1 year ago

Wow...that's what I get for reading a paper then several days later getting so laser-focused on a task that I go back to the paper, search for "layer 6", and only read two sentences and one reference around the search result. Thanks for the kind and detailed response to a half-baked question.

Fig. 2 from the original WavLM paper ([6]) does show that WavLM-Large's layerwise task contribution is sort of a "stretched" version of WavLM-Base+'s 12 layers, but the correlation isn't perfect (for example, layer 24 in Large seems to do well in the speaker ID task, but none of the later layers in Base does). Perhaps there's some interference from the semantic information captured by the later layers, and that makes the best-performing early layer a better choice for the VC task.

I guess my main remaining question given the info in the original WavLM paper is whether WavLM-Base+ layers 4/6 would perform similarly. No way to know except trying, I suppose :). Thanks again—and thanks for releasing your research and code; it's a creative and elegant use of pretrained models that doesn't add a lot of extra machinery to the process, which is refreshing in the current environment.