Tacotron2 produces very deep voice

NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

Apache License 2.0

11.95k stars 2.49k forks source link

I trained Tacotron2 with my own dataset in Swedish. Alignment looks terrible but it produces very good pronunciation. The issue I'm having is that the infer result is very low pitched. The source is a male voice with low pitch but not as low as the infer results. I'm using the default config and only changing trim_silence to true and updating labels according to my training data. The training audio is 22khz.

Is there something I can do to make the infer results sound more like the source?

Eval alignment:

Infer with default WaveGlow model: https://soundcloud.com/user-839318192/nemo-waveglow?in=user-839318192/sets/nemo-tacotron2-swedish Infer with GriffinLim: https://soundcloud.com/user-839318192/nemo-griffinlim-norm?in=user-839318192/sets/nemo-tacotron2-swedish

NVIDIA / NeMo

Tacotron2 produces very deep voice #899