Open neuronx1 opened 2 years ago
Hi, could you show the attention score? The generated attention does not matter, what's used for duration extraction is the ground truth aligned one. 900 samples is quite few for generating attention with tacotron - what language are the samples in and are you using phonemes? For a small dataset like this one could try to pretrain a tacotron model on a different dataset until attention is built up and then continue training on the smaller dataset. Also, it could make sense to set the trim_long_silences=True and vad_max_silence_length=6 or so for shorter silent parts in the audios, which helps attention to build up.
Hi @cschaefer26,
thanks for your great repository.
Unfortunatley I get really bad results, I think the reason is because of bad alignment.
I train the models on a german dataset, containing 900 samples, each between 5 and 30 seconds. The sampling rate is 22050 and they are 16 bit (mono). I ran your preprocessing step. My tensorboard looks like this (as you can see there is no alignment). What's the reason for this and how can I solve it? I really appreciate every help!
Thanks in advance!