Closed Choiuijin1125 closed 3 years ago
I also suspect this issue because of my datasets. I'm not using script data but lecture data(lecture audio --> Quartnet --> split audio by punctuation --> remove noise audios --> tts) Even though I'm using a bit low-quality datasets. it works very well except the end of a frame. the model can't coverage the end of the sentence and synthesize noise.
I just found that in training every sample is padded to the length of the largest mel. https://github.com/NVIDIA/tacotron2/issues/356.
Then how can I tune to fix the end of the frame like the below image?
@Choiuijin1125 I have a korean dataset too, can you guide me how to train from scratch? Thanks!
Thank you for building a great open source project.
I'm trying to train Tarcotron using Korean datasets.
While training I just noticed that depending on
max_duration
, frame padding length will be changed, below images showval_target
Mel spectrogram wheremax_duration
is over 3.5sec and below 3.5sec.and I think it makes
valid_prediction
unstable. like the below images, Tacotron generates some noise at the end of a frame.Is there any way to tune this problem? or is it normal behavior for Tacotron?