Open Ahmer-444 opened 1 year ago
Hi @Ahmer-444, I'm sorry to be late... Thank you for your question! I have some concerns.
Thanks @MasayaKawamura for your response.
I was able to train and fix the above problems and now I'm working with around ~4hrs of the audio dataset. The synthesized voice sounds pretty cool but still has a little robotic undertone. Do you have any suggestions to improve it?
You can listen to the synthesis voice here.
The Attention and Loss graphs looks like this:
If the model used deterministic duration predictor, how about changing it to stochastic duration predictor? You can change the duration predictor by modifying this line.
Hi @MasayaKawamura !
Thanks for sharing your work with the community, really appreciate it.
I gave it a try (single speaker settings) with around 2 hrs of data. Was able to get some results after around 25K steps and kept it running up-to around 75K steps. But, still there are not much improvements (issues like mispronunciation, some background noise).
Moreover, Its able to generate sentences in some of the training set well, but can't on others from training or test set.
Any suggestions in terms of datasets/settings or way to move forward?
Some samples can be found here. Thanks