NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.28k stars 2.55k forks source link

Tune frame length while training Tacotron? #2782

Closed Choiuijin1125 closed 3 years ago

Choiuijin1125 commented 3 years ago

Thank you for building a great open source project.

I'm trying to train Tarcotron using Korean datasets.

While training I just noticed that depending on max_duration, frame padding length will be changed, below images show val_target Mel spectrogram where max_duration is over 3.5sec and below 3.5sec.

image

and I think it makes valid_prediction unstable. like the below images, Tacotron generates some noise at the end of a frame.

image

Is there any way to tune this problem? or is it normal behavior for Tacotron?

Choiuijin1125 commented 3 years ago

I also suspect this issue because of my datasets. I'm not using script data but lecture data(lecture audio --> Quartnet --> split audio by punctuation --> remove noise audios --> tts) Even though I'm using a bit low-quality datasets. it works very well except the end of a frame. the model can't coverage the end of the sentence and synthesize noise.

Choiuijin1125 commented 3 years ago

I just found that in training every sample is padded to the length of the largest mel. https://github.com/NVIDIA/tacotron2/issues/356.

Then how can I tune to fix the end of the frame like the below image? image

p0p4k commented 2 years ago

@Choiuijin1125 I have a korean dataset too, can you guide me how to train from scratch? Thanks!