My experience and pronunciation problems

Hi, First, I wanted to congratulate you for this amazing work and wanted to share my experience.

I decided to give it a go and tested a dataset on a RTX 3090 for 1500 epochs (~217k steps). It took almost 4 days.

I can say that the audio quality is perfect, I can't notice any difference between the source audios and the inferenced audios. This is something I just cound't get with VITS1, I can always notice differences in audio quality between source clips and inferenced clips. Again, this is in terms of audio quality.

However, when it comes to pronunciation, there are always some words that are mispronunced. I trained the same datasets with a few differente configs.

This is the config I'm using:

{
  "use_mel_posterior_encoder": true,
  "use_transformer_flows": true,
  "transformer_flow_type": "pre_conv2",
  "use_spk_conditioned_encoder": false,
  "use_noise_scaled_mas": true,
  "use_duration_discriminator": ***,
  "duration_discriminator_type": "dur_disc_2",
  "ms_istft_vits": true,
  "mb_istft_vits": false,
  "istft_vits": false,
  "use_sdp": ***
}

I trained with: use_sdp = true and use_duration_discriminator = true; The inferenced clips seems like the person is drunk, some times it pronounces words wrong and does not produce good outputs at all.

use_sdp = false and use_duration_discriminator = true; Does not seem to have pronunciation problems, but the output is very robotic. It does not sound natural, but might be good for some use cases.

use_sdp = true and use_duration_discriminator = false; This is the one I trained for 1500 epochs, the output was much more natural than the other 2, but it always had pronunciation problems. It seems like the more I train, the less this problem appear, but I decided to stop training after 1500 epochs as I cound't see any improvements after that. Regardless, this is the one I would use if I had to pick one.

Does anyone have any tips on how to improce the pronunciation issues? I have the same model trained with vits1, I trained with sdp = true for around 1500 epochs as well, and it outputs really good results, in termos of sounding natural, but I can never get the audio quality to match the original clips. I have hopes that I'm close to get perfect pronunciation with perfect audio quality (which is already being done) but I don't know what else I could try.

To make it clear, it's not always the same words that are mispronunced, every time I run inference with the same input, the output varies in terms of what is mispronounced, some times it outputs a perfect audio, but not aways.

Does anyone have any tips on how to improve it?

Unfortunatelly I can't share audio samples as I don't have authorization to do so.

Thank you!

FENRlR / MB-iSTFT-VITS2

My experience and pronunciation problems #20