as-ideas / ForwardTacotron

⏩ Generating speech in a single forward pass without any attention!
https://as-ideas.github.io/ForwardTacotron/
MIT License
578 stars 113 forks source link

Bad Alignment #69

Open neuronx1 opened 2 years ago

neuronx1 commented 2 years ago

Hi @cschaefer26,

thanks for your great repository.

Unfortunatley I get really bad results, I think the reason is because of bad alignment.

I train the models on a german dataset, containing 900 samples, each between 5 and 30 seconds. The sampling rate is 22050 and they are 16 bit (mono). I ran your preprocessing step. My tensorboard looks like this (as you can see there is no alignment). grafik grafik What's the reason for this and how can I solve it? I really appreciate every help!

Thanks in advance!

cschaefer26 commented 2 years ago

Hi, could you show the attention score? The generated attention does not matter, what's used for duration extraction is the ground truth aligned one. 900 samples is quite few for generating attention with tacotron - what language are the samples in and are you using phonemes? For a small dataset like this one could try to pretrain a tacotron model on a different dataset until attention is built up and then continue training on the smaller dataset. Also, it could make sense to set the trim_long_silences=True and vad_max_silence_length=6 or so for shorter silent parts in the audios, which helps attention to build up.