Closed yjlolo closed 1 year ago
Sorry for late reply...
- According to https://github.com/Wendison/VQMIVC/blob/851b4f5ca5bb60c11fea6a618affeb4979b17cf3/convert.py#L25 , the files are filtered based on this duration criteria. Is this applied to the test set for any of the objective/subjective evaluations reported in the paper?
No, convert.py is just used for demonstration purpose. For the test set used in the paper, wav-length is not used for filtering.
- Table 2 is explained with the speaker representation capturing content information. Because the speaker representation is a single global vector, do you have an intuition for what content information the global vector can capture besides length of the input sequence (although this information is unlikely due to the wrap padding)?
I think this 'global vector' may contain other global information that relates to the content, such as intention or emotion, when the reference utterance has such information that differs from source speech, mismatch appears to harm the pronunciation of converted speech.
- Table 3: Because the data splits have non-overlapping speakers, how exactly do you train the speaker classifier? Normally, one can train the classifier given features derived from a trained encoder using the training set, then evaluate the classifier with the test set. That is, the train/test split is consistent for both the classifier and the encoder (or VQMIVC), and this can be done because the train and test sets share the same vocabularies or classes. However, we have non-overlapping speaker classes here, so I wonder how the speaker classifier is trained and evaluated.
I only use the speakers in the test-set for training & testing of speaker classifier, noted that the training set and testing set of speaker classifier have the same set of speakers but non-overlapping utterances.
- I wonder if you have considered speaker verification (SV) using a pre-trained speaker encoder as an alternative for evaluating the speaker representation. The speaker encoder of a VC model can learn discriminative speaker embedding, and still perform poorly in terms of speaker conversion, because the decoder can't utilise the information well. As a result, SV might be a good complementary metric.
I have tried to use SV model to extract speaker embeddings and visualize them via t-sne, different clusters can be formed, which means that the converted speech has different timbre. I didn't display the results in the paper due to page limits. You can use SV model to calculate SV metrics, e.g., EER, which is widely used in other VC paper to better evaluate VC models.
- Based on your experiments, how sensitive is VQMIVC to the parameters like loss weights, and the CPC parameters? I am trying to implement the model and make it consistent with other baselines, which deviates from the parameters you set.
I think loss weights of MI are very important for final VC performance, they need to be carefully tuned. CPC parameters maybe sensitive, but I'm not sure as I haven't done such ablation study.
Thank you for your thorough reply!
Thanks for sharing the code. I have the following questions for which I hope you could clarify for me, if possible.