Wendison / VQMIVC

Official implementation of VQMIVC: One-shot (any-to-any) Voice Conversion @ Interspeech 2021 + Online playing demo!
MIT License
337 stars 55 forks source link

lf0 question about convert phase #34

Open powei-C opened 2 years ago

powei-C commented 2 years ago

Hi, I wonder why you normalize f0 series before feeding to the f0encoder in convert.py. However, this kind of normalization for f0 isn't used in preprocessing phase.

Wendison commented 2 years ago

Hi, normalizing f0 aims to remove the speaker characteristics. During preprocessing phase, f0 is not normalized, but during training and inference, f0 is normalized as shown below: https://github.com/Wendison/VQMIVC/blob/851b4f5ca5bb60c11fea6a618affeb4979b17cf3/dataset.py#L53 https://github.com/Wendison/VQMIVC/blob/851b4f5ca5bb60c11fea6a618affeb4979b17cf3/convert_example.py#L57

powei-C commented 2 years ago

Hi, thank you for your explanation!!! I have another question about perplexity when training the model with another dataset. I found that the perplexity didn't keep increasing (have run around 360 epochs in the figure), was it reasonable? And do you have any suggestions to verify this issue? image

Wendison commented 2 years ago

The perplexity should be increasing during training, as higer perplexity indicates that the vectors in the VQ codebook are distinguiable and can be used to represent different acoustic units. I also saw your recon_loss is high. Based on my experience, recon_loss should be less than 0.5, then you would obtain good converted samples.