MasayaKawamura / MB-iSTFT-VITS

Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform
Apache License 2.0
401 stars 64 forks source link

Any Suggestions to Get Best Training Results With Small Datasets? #6

Open Ahmer-444 opened 1 year ago

Ahmer-444 commented 1 year ago

Hi @MasayaKawamura !

Thanks for sharing your work with the community, really appreciate it.

I gave it a try (single speaker settings) with around 2 hrs of data. Was able to get some results after around 25K steps and kept it running up-to around 75K steps. But, still there are not much improvements (issues like mispronunciation, some background noise).

Moreover, Its able to generate sentences in some of the training set well, but can't on others from training or test set.

Any suggestions in terms of datasets/settings or way to move forward?

Some samples can be found here. Thanks

MasayaKawamura commented 1 year ago

Hi @Ahmer-444, I'm sorry to be late... Thank you for your question! I have some concerns.

Ahmer-444 commented 1 year ago

Thanks @MasayaKawamura for your response.

I was able to train and fix the above problems and now I'm working with around ~4hrs of the audio dataset. The synthesized voice sounds pretty cool but still has a little robotic undertone. Do you have any suggestions to improve it?

You can listen to the synthesis voice here.

The Attention and Loss graphs looks like this:

Attention Graph loss_audio

MasayaKawamura commented 1 year ago

If the model used deterministic duration predictor, how about changing it to stochastic duration predictor? You can change the duration predictor by modifying this line.