拜读了您的论文，有几个问题

splinter21 commented 2 years ago

1、有关speaker encoder 论文中的实验数据结论是，音色相似度，pretrained d-vector好于simple trained speaker encoder(lstm+linear+mean) interspeech2022 best paper zero shot tts中采用的speaker encoder结构，以及https://arxiv.org/abs/2106.03153 transferTTS中用的mel输入的speaker encoder结构，他们和本文中的-s一样都是不采用预训练，我觉得有价值对比一下，他们都没有做和pretrained d-vector的ab实验 2、有关音色泄露（1）有关频谱增强 A、采用vocoder进行音色音调时长的增强，虽然能改变音色，但是我担心会不会导致wavLM的语义识别性能下降；是否考虑用传统的变调变音色算法进行ab实验； B、有没有可能，模型本身也能学会数据增强变调变音色的reconstruction（也就是说本质上还是认为变调后的训练src和tgt依然是同一个音色，被模型自动学去重建了src的最佳形态，那还是音色泄漏了）（2）有关bottleneck维度压缩根据代码发现似乎是压到了192（如果有不对请指出），但是hubert base256相比192也没有很多，192只是相对1024压缩了很多，我认为这压缩得还不够小，自监督本身出来给VC用的维度其实不需要1024那么多；可以尝试压缩到更暴力的维度，比如只有4维，再加上量化等 3、有关实验数据 MOS unseen2unseen打败了seen2seen，有点想不明白

splinter21 commented 2 years ago

有关reconstruction的解释：我从前玩过galgame角色的变声器，主角是一个幼女角色，我在用au做数据增强的时候，发现降调2key后，出其得像声优真人录音，而galgame角色的音色，反而是被变调后的。那么我通过声音的人类自然性的判断，破解了作者录音后升调2key作为galgame角色声线的过程，这就是一个reconstruction 一个0key的录音，被+-key增强后，自然度和真实性肯定不如原本的0key录音，我人耳尚且可以逆向，我认为模型也完全可以学会把他重建到0key，那么推理时担心的src音色，又重新被还原泄露了。

OlaWod commented 2 years ago

A better speaker encoder structure can bring better results. In our paper, we just want to prove that, as long as the extracted content representation is clean enough, the speaker encoder will learn to model the missing speaker information, even using such an extremely simple speaker encoder structure.
2(1)A. I think as long as the vocoder is good enough, the quality degradation won't be impactful. I‘ve never seen anyone do an ablation study to data augmentation methods, they just propose it. So currently I don't have the plan to do this ablation, sorry.
2(1)B. That's why we compress the bottleneck. Using a naive autoencoder we can do waveform reconstruction. If we compress the latent dim of this autoencoder to a proper size then we can do the VC task.
2(2). Yes it's 192. A too narrow bottleneck will lose some content information, while a too wide bottleneck will contain some speaker information. If we use a bottleneck dimension of 4, it will lose a lot of content information. Searching the best bottleneck dimension is troublesome and thus we use the SR-based augmentation to help the model learn to discard residual speaker information in the 192-dim bottleneck. As for quantization, at the very beginning of our experiment, we used residual vector quantization after the 192-dim bottleneck and found that it didn't bring any significant improvement, so we removed it.
I think this may be because of the quality of source speech. Seen sources, which are from VCTK, generally have a more unclear pronunciation (like p259_464); while unseen sources, which are from LibriTTS, have more background noise (like 5105_28233_000016_000001). From the demo page we can hear that our model can ignore the noise but the pronunciation, which is also part of content, remains the same. Also, some unseen sources have a much longer length (5105_28233_000016_000001 is 21 seconds long), I don't know if the wav length can affect the quality judgement.

OlaWod / FreeVC

拜读了您的论文，有几个问题 #7