NVIDIA / tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference
BSD 3-Clause "New" or "Revised" License
5.07k stars 1.38k forks source link

Check Tacotron output w/o trained vocoder #158

Closed BartekRoszak closed 5 years ago

BartekRoszak commented 5 years ago

I am training a Tacotron model with a custom dataset. In inference.py script I can check how well the model is a the moment but I have to have WaveGlow model to create waveform. I do not have computation power to train two models in parallel (Tacotron & WaveGlow). So now I cannot check how well Tacotron is doing because I cannot create waveform. Is there any option to create waveform directly from Tacotron without WaveGlow?

tugstugi commented 5 years ago

You can use the pretrained LJSpeech waveglow for any language. It will even work for a male voice.

tugstugi commented 5 years ago

Here is a synthesized example for a Mongolian male voice using the LJSpeech trained waveglow:

10k_mongolian_ljspeech_vocoder.zip

BartekRoszak commented 5 years ago

Unfortunately, I have a female voice.

tugstugi commented 5 years ago

Then, it will even work better. LJSpeech is a female voice.

BartekRoszak commented 5 years ago

It is not sound well. You can hear in the background real voice but the noise awful. I feed it with mel_spectrogram created directly from wav file by get_mel method in TextMelLoader. I use default hparams.

I send original wav and preprocessed mel-spectrogram after pretrained WaveGlow.

original.wav.zip pretrained_waveglow.wav.zip

delgerdalai commented 5 years ago

Your wav file is 32(float). You have to change normalization code.

screen shot 2019-03-05 at 3 19 29 pm

generated_pretrained_LJS.wav.zip is generated by pretrained LJS waveglow. It sounds pretty well.

BartekRoszak commented 5 years ago

@delgerdalai Thanks! Sounds much better. It won't be a problem that each audio file will be normalized using different value? Or I should put there some constant value which will fit the whole dataset?

delgerdalai commented 5 years ago

I think goal is convert audio file to [-1, 1] range.

I don't know about wav 32bit float file format. Maybe you can find maximum value of whole dataset and then you can use it by constant.

Or just audio_norm = audio / audio.max() might be better.

for 16bit int wav file:

audio_norm = audio / 16bit integer maximum value. _LJS dataset wav is 16bit integer. Therefore hparams.max_wavvalue is 32767.

BartekRoszak commented 5 years ago

Thank you. It looks like files are already normalized and values are in the range <-1,1>. No need to do normalization in my case :)