NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.95k stars 2.49k forks source link

Tacotron2 produces very deep voice #899

Closed BoneGoat closed 3 years ago

BoneGoat commented 4 years ago

I trained Tacotron2 with my own dataset in Swedish. Alignment looks terrible but it produces very good pronunciation. The issue I'm having is that the infer result is very low pitched. The source is a male voice with low pitch but not as low as the infer results. I'm using the default config and only changing trim_silence to true and updating labels according to my training data. The training audio is 22khz.

Is there something I can do to make the infer results sound more like the source?

Eval alignment:

nemo-eval-align

Infer with default WaveGlow model: https://soundcloud.com/user-839318192/nemo-waveglow?in=user-839318192/sets/nemo-tacotron2-swedish Infer with GriffinLim: https://soundcloud.com/user-839318192/nemo-griffinlim-norm?in=user-839318192/sets/nemo-tacotron2-swedish

hubertsiuzdak commented 3 years ago

It happened to me too and I found out the default tacotron2 config was messed up. With no fmax specified in the config, the highest frequency of mel spectrograms produced by preprocessor will default to sample_rate / 2: https://github.com/NVIDIA/NeMo/blob/10ddab6ff48a5afb389a57604bff6c6681b3257f/nemo/collections/asr/parts/features.py#L292 So you actually trained the model to generate mel-scaled spectrograms with frequencies ranging from 0 to 11025 Hz. And then when you try to synthesize the audio using vocoders that expects these features to be 0-8000 Hz, you get that pitch-shift effect. It seems to be resolved by #1959.