Closed michael-conrad closed 4 years ago
Predicted/forced, step 13
Predicted/generated, step 13
Target/eval, step 13
The curves seem to be ok. You can expect the final MCD around 3 or 4.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I am trying to replicate the generated_switching config with CSS10 + Comvoi clean the only thing I changed is the batch size to 100 because I am using a RTX A6000.
MCD does not go below 5 could there be something wrong ?
Hi, I would also increase learning rate when increasing batch size, but the MCD values are fine IMHO. Give it more time.
Here are my MCDs, you are interested in the last column.
1) Is this the generated_switching or generated_training ? you don't have the similar grap for the eval set ?
2) one unrelated question. when generating the wav file with a speaker-id that has not been used in the WaveRNN training, can I expect a decent result or not at all. It seems to do something but unclear whether we can achieve something decent. I am wondering whether it is just a matter of training WaveRNN on many more speakers or not.
thanks for your insights, great work btw.
This is generated_training
. I don't, but you can expect something 3-4ish I think.
You can check it out in the demo notebooks. For some voices it is ok, for some it is not. I am afraid that WaveRNN needs a lot of data per speaker to sound good ... but there is a more recent and SOTA workaround. I would suggest you using the pretrained vocoders from espnet. They are multi-speaker and sound great but they also expect sharp spectrograms as input. To make outputs of my model sharper, you can replace the old convolutional postnet with something more sexy like a postnet based on normalizing flows (here is an open and probably working implementation)
You can avoid the pain with vocoders for free by using espnet and replacing the postnet should be more or less copy & paste drop-in replacement and addition of one more loss term.
What type of loss_total number, etc, should I be looking for to verify that things seem to be training correctly?
I'm currently at step: 3.792k, 3 hours 12 minutes, total loss 0.2972