Tomiinek / Multilingual_Text_to_Speech

An implementation of Tacotron 2 that supports multilingual experiments with parameter-sharing, code-switching, and voice cloning.
MIT License
829 stars 158 forks source link

How effective is my training so far? #16

Closed michael-conrad closed 4 years ago

michael-conrad commented 4 years ago

What type of loss_total number, etc, should I be looking for to verify that things seem to be training correctly?

I'm currently at step: 3.792k, 3 hours 12 minutes, total loss 0.2972

michael-conrad commented 4 years ago

image

image

image

michael-conrad commented 4 years ago

und sich damit zu schützen, daß, sobald kein wirkliches bestimmtes verbrechen feststehe, dessen man ihn beschuldige,

Predicted/forced, step 13

image

Predicted/generated, step 13

image

Target/eval, step 13

image

Tomiinek commented 4 years ago

The curves seem to be ok. You can expect the final MCD around 3 or 4.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

vince62s commented 2 years ago

I am trying to replicate the generated_switching config with CSS10 + Comvoi clean the only thing I changed is the batch size to 100 because I am using a RTX A6000.

MCD does not go below 5 could there be something wrong ?

image

Tomiinek commented 2 years ago

Hi, I would also increase learning rate when increasing batch size, but the MCD values are fine IMHO. Give it more time.

Here are my MCDs, you are interested in the last column. image

vince62s commented 2 years ago

1) Is this the generated_switching or generated_training ? you don't have the similar grap for the eval set ?

2) one unrelated question. when generating the wav file with a speaker-id that has not been used in the WaveRNN training, can I expect a decent result or not at all. It seems to do something but unclear whether we can achieve something decent. I am wondering whether it is just a matter of training WaveRNN on many more speakers or not.

thanks for your insights, great work btw.

Tomiinek commented 2 years ago
  1. This is generated_training. I don't, but you can expect something 3-4ish I think.

  2. You can check it out in the demo notebooks. For some voices it is ok, for some it is not. I am afraid that WaveRNN needs a lot of data per speaker to sound good ... but there is a more recent and SOTA workaround. I would suggest you using the pretrained vocoders from espnet. They are multi-speaker and sound great but they also expect sharp spectrograms as input. To make outputs of my model sharper, you can replace the old convolutional postnet with something more sexy like a postnet based on normalizing flows (here is an open and probably working implementation)

Tomiinek commented 2 years ago

You can avoid the pain with vocoders for free by using espnet and replacing the postnet should be more or less copy & paste drop-in replacement and addition of one more loss term.