NVIDIA / waveglow

A Flow-based Generative Network for Speech Synthesis
BSD 3-Clause "New" or "Revised" License
2.29k stars 530 forks source link

How to train with GTA mels/audios #85

Closed alexdemartos closed 5 years ago

alexdemartos commented 5 years ago

I am wondering how would I train this model using the generated GTA mels output from a Tacotron-2 model. Any advice? Thank you!

Yeongtae commented 5 years ago

Check my forked repositories. Already implemented it in

https://github.com/Yeongtae/tacotron2/blob/master/GTA.py https://github.com/Yeongtae/waveglow/blob/master/mel2samp.py

feddybear commented 5 years ago

@Yeongtae Thanks for making this easier. I understand that waveglow's authors think that training with GTA isn't necessary, but after reading the tacotron2 paper, it was natural to ask for this. I think it might work better for other low resource training datasets.

So I tried using your waveglow repo's GTA.py to generate the aligned mel outputs from my trained tacotron2 model. I used exactly the same parameters in your version of hparams.py except for the batch size (because of GPU memory limitations) when I trained my text-to-mel model. However, I am getting this error:

Traceback (most recent call last):
  File "GTA.py", line 184, in <module>
    GTA_Synthesis(args.output_directory, args.checkpoint_path, args.n_gpus, args.rank, args.group_name, hparams)
  File "GTA.py", line 139, in GTA_Synthesis
    _, mel_outputs_postnet, _, _ = model(x)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
    raise output
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/model.py", line 526, in forward
    output_lengths)
  File "/model.py", line 502, in parse_output
    outputs[0].data.masked_fill_(mask, 0.0)
RuntimeError: The expanded size of the tensor (1562) must match the existing size (1225) at non-singleton dimension 2.  Target sizes: [4, 80, 1562].  Tensor sizes: [4, 80, 1225]

I wonder where I can fix this difference in dimension? Many thanks in advance!

Yeongtae commented 5 years ago

@feddybear In my guess, If you use your own hparam.py for text to mel model, or If gta.py is attached your orginal repo. for text to mel model, It will be solved.

I'm not using GTA.py, because of the following reason.

Why do we use GTA synthesis?

There is a more suitable solution. That is normalizing log Mel-spectrograms, e.g, range in [-4, 4].

Here is my implementation for normalization to [-4, 4]. https://github.com/Yeongtae/tacotron2/blob/master/layers.py https://github.com/Yeongtae/tacotron2/blob/master/audio_processing.py https://github.com/Yeongtae/tacotron2/blob/master/inference.py

I referred to the code of https://github.com/Rayhane-mamah/Tacotron-2

feddybear commented 5 years ago

@Yeongtae thank you for your response!

Regarding GTA, I actually cloned your repo and used it instead of the original, so there's no difference between your settings and mine except for the batch size and the number of symbols (since I'm doing a different language). I will stop this training with GTA direction, then.

Anyway, thank you for recommending to do mel-spectrum normalization instead. So I assume we also need to train waveglow with these normalized mel spectra?

Yeongtae commented 5 years ago

@feddybear

I didn't check after having been used GTA.py, since I'm not using this code, currently.

GTA.py is just inference code to generate GTA mel from training data with trained text to mel model.

Yeongtae commented 5 years ago

@feddybear

image I just tested the GTA.py code. There is no problem in my case.

Check the detail dimension of your model.

I'm sorry I can not help you.

sharathadavanne commented 4 years ago

Hi @Yeongtae regarding your comment on normalizing the mel-features in the [-4, 4] range, I was wondering why this range? and why not the usual zero-mean unit-variance range of [-1 1]? Any insight would be helpful.

Yeongtae commented 4 years ago

In my guess, [-1, 1] is also fine. It is because that Mamah uses this range [-4, 4] in https://github.com/Rayhane-mamah/Tacotron-2.

sharathadavanne commented 4 years ago

Thanks for your quick comment @Yeongtae

begeekmyfriend commented 4 years ago

https://github.com/begeekmyfriend/tacotron2/commit/0357bf763ec8e5aebaffa38ae7ef4adf997ed237 Here is my own implementation. Just type

bash scripts/gta_synth.sh

Anyway, it also supports reduction factor and the preprocessing is compatible with Rayhane Mamah's version

tshmak commented 3 years ago

@Yeongtae, I wonder what made you decide to implement training with GTA mels, and why @rafaelvalle did not think it is necessary? I tried generating audio with NVIDIA's tacotron2 with the pretrained models. The sound quality was great. However, I trained the pretrained Tacotron2 model on the LJspeech data for another few epochs or so, and the quality deteriorated. My hypothesis was that this is because the waveglow model is now not fine-tuned on the exact Tacotron model used for inference. However, it seems the pretrained waveglow model was not trained on the generated aligned mels either. What are your thoughts?

Yeongtae commented 3 years ago

Mismatch problem is one of the most important topics in researches related vocoder. So, recent researches related vocoder take finetuning strategy using synthesized mel.

However, even without these efforts, hifi-gan produces good quality synthesized speech.