Closed BoneGoat closed 3 years ago
It happened to me too and I found out the default tacotron2 config was messed up. With no fmax
specified in the config, the highest frequency of mel spectrograms produced by preprocessor will default to sample_rate / 2
: https://github.com/NVIDIA/NeMo/blob/10ddab6ff48a5afb389a57604bff6c6681b3257f/nemo/collections/asr/parts/features.py#L292 So you actually trained the model to generate mel-scaled spectrograms with frequencies ranging from 0 to 11025 Hz. And then when you try to synthesize the audio using vocoders that expects these features to be 0-8000 Hz, you get that pitch-shift effect. It seems to be resolved by #1959.
I trained Tacotron2 with my own dataset in Swedish. Alignment looks terrible but it produces very good pronunciation. The issue I'm having is that the infer result is very low pitched. The source is a male voice with low pitch but not as low as the infer results. I'm using the default config and only changing trim_silence to true and updating labels according to my training data. The training audio is 22khz.
Is there something I can do to make the infer results sound more like the source?
Eval alignment:
Infer with default WaveGlow model: https://soundcloud.com/user-839318192/nemo-waveglow?in=user-839318192/sets/nemo-tacotron2-swedish Infer with GriffinLim: https://soundcloud.com/user-839318192/nemo-griffinlim-norm?in=user-839318192/sets/nemo-tacotron2-swedish