Labmem-Zhouyx / CDFSE_FastSpeech2

The Official Implementation of “Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis”
MIT License
80 stars 12 forks source link

What normal losses value should be #1

Closed josh-zhu closed 2 years ago

josh-zhu commented 2 years ago

Hi, Thanks for your work and sharing code. I tried to train the cdfse model with aishell3 dataset but got abnormal synthesis result without playing sound and a very low classify accuracy value as follows finally.

Step 474000/500000, Total Loss: 11.3225, Mel Loss: 0.4685, Mel PostNet Loss: 0.4679, Pitch Loss: 0.0482, Energy Loss: 0.0543, Duration Loss: 0.0333, Cls Loss: 5.8576, Cls acc: 0.0400

Could you help me figured out which wrong step maybe leads in this result and what should the normal losses value looks like?

Labmem-Zhouyx commented 2 years ago

Hello, thank you very much for focusing on this work. The normal Cls Loss is about 4.7 at 500k steps for AISHELL-3 dataset, and the acc is about 0.65. image

If Cls Loss is around 5.8, it probably means Phoneme Classifier is not working. You can check the “weight:cls” in config/AISHELL3/model.yaml.

In addition, the code of this repo has not been fully organized. We will release an official implementation with instructions later.

josh-zhu commented 2 years ago

Much appreciated for your quick reply and help. The classify weight is 1.0 kept unchanged as default value. I used a different hifigan config model which may lead to the weird result. I will train a new hifigan and take a new experiment. Learns a lot from your repos, thank you again!

Labmem-Zhouyx commented 2 years ago

Much appreciated for your quick reply and help. The classify weight is 1.0 kept unchanged as default value. I used a different hifigan config model which may lead to the weird result. I will train a new hifigan and take a new experiment. Learns a lot from your repos, thank you again!

I found that this error might be due to the code in text/symbols.py (we neglected to update it in past commits). If we run the code in AISHELL3, the symbol dict is best to exclude inrelevant phonemes (like English alphbets or others), otherwise it will greatly increase the difficulty of phoneme classification and cause this loss fail. Hope that solves your problem. image

josh-zhu commented 2 years ago

Thank you @Labmem-Zhouyx , I retrain the model and get the loss roughly the same as you showed above. I add some custom voice data for synthesis, and it seems that the unseen speaker voice similarity drops down largely while the content phoneme features remains as good as seen speakers. Would you mind to leave a connection manner or add my QQ:359974136 for discussing?

Labmem-Zhouyx commented 2 years ago

It's a nice question. If the unseen speaker's voice is significantly different from the AISHELL3 dataset, the timbre similarity may drop largely because we have not made careful designs (like introducing GE2E loss, pre-trained Speaker Encoder on a very large speaker dataset, Contrastive Learning, etc.) in this regard. As shown in the paper, our method mainly aims to transfer personal pronunciation characteristics related to phoneme content to improve the speaker similarity from the sense of listening, and it seems to work in most cases. The phenomenon you mentioned is the major problem faced by the current one-shot/zero-shot TTS system, that is, there is a domain difference between the training set and the unseen speaker, which also exists in GSE, CLS, or other methods.

To further improve the speaker timbre similarity for unseen speakers, I think there are the following possible ways:

  1. Incorporate some careful designs mentioned above to improve the generalization of speaker embedding.
  2. Do model adaptation (fine-tuning the model with speech data of the target speaker).

By the way, your QQ is unable to be searched. If you are interested in further discussion, you can send emails to zhouyx20@mails.tsinghua.edu.cn :)