Closed alexdemartos closed 5 years ago
Check my forked repositories. Already implemented it in
https://github.com/Yeongtae/tacotron2/blob/master/GTA.py https://github.com/Yeongtae/waveglow/blob/master/mel2samp.py
@Yeongtae Thanks for making this easier. I understand that waveglow's authors think that training with GTA isn't necessary, but after reading the tacotron2 paper, it was natural to ask for this. I think it might work better for other low resource training datasets.
So I tried using your waveglow repo's GTA.py to generate the aligned mel outputs from my trained tacotron2 model. I used exactly the same parameters in your version of hparams.py except for the batch size (because of GPU memory limitations) when I trained my text-to-mel model. However, I am getting this error:
Traceback (most recent call last):
File "GTA.py", line 184, in <module>
GTA_Synthesis(args.output_directory, args.checkpoint_path, args.n_gpus, args.rank, args.group_name, hparams)
File "GTA.py", line 139, in GTA_Synthesis
_, mel_outputs_postnet, _, _ = model(x)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
output = module(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/model.py", line 526, in forward
output_lengths)
File "/model.py", line 502, in parse_output
outputs[0].data.masked_fill_(mask, 0.0)
RuntimeError: The expanded size of the tensor (1562) must match the existing size (1225) at non-singleton dimension 2. Target sizes: [4, 80, 1562]. Tensor sizes: [4, 80, 1225]
I wonder where I can fix this difference in dimension? Many thanks in advance!
@feddybear In my guess, If you use your own hparam.py for text to mel model, or If gta.py is attached your orginal repo. for text to mel model, It will be solved.
I'm not using GTA.py, because of the following reason.
Why do we use GTA synthesis?
There is a more suitable solution. That is normalizing log Mel-spectrograms, e.g, range in [-4, 4].
Here is my implementation for normalization to [-4, 4]. https://github.com/Yeongtae/tacotron2/blob/master/layers.py https://github.com/Yeongtae/tacotron2/blob/master/audio_processing.py https://github.com/Yeongtae/tacotron2/blob/master/inference.py
I referred to the code of https://github.com/Rayhane-mamah/Tacotron-2
@Yeongtae thank you for your response!
Regarding GTA, I actually cloned your repo and used it instead of the original, so there's no difference between your settings and mine except for the batch size and the number of symbols (since I'm doing a different language). I will stop this training with GTA direction, then.
Anyway, thank you for recommending to do mel-spectrum normalization instead. So I assume we also need to train waveglow with these normalized mel spectra?
@feddybear
I didn't check after having been used GTA.py, since I'm not using this code, currently.
GTA.py is just inference code to generate GTA mel from training data with trained text to mel model.
@feddybear
I just tested the GTA.py code. There is no problem in my case.
Check the detail dimension of your model.
I'm sorry I can not help you.
Hi @Yeongtae regarding your comment on normalizing the mel-features in the [-4, 4] range, I was wondering why this range? and why not the usual zero-mean unit-variance range of [-1 1]? Any insight would be helpful.
In my guess, [-1, 1] is also fine. It is because that Mamah uses this range [-4, 4] in https://github.com/Rayhane-mamah/Tacotron-2.
Thanks for your quick comment @Yeongtae
https://github.com/begeekmyfriend/tacotron2/commit/0357bf763ec8e5aebaffa38ae7ef4adf997ed237 Here is my own implementation. Just type
bash scripts/gta_synth.sh
Anyway, it also supports reduction factor and the preprocessing is compatible with Rayhane Mamah's version
@Yeongtae, I wonder what made you decide to implement training with GTA mels, and why @rafaelvalle did not think it is necessary? I tried generating audio with NVIDIA's tacotron2 with the pretrained models. The sound quality was great. However, I trained the pretrained Tacotron2 model on the LJspeech data for another few epochs or so, and the quality deteriorated. My hypothesis was that this is because the waveglow model is now not fine-tuned on the exact Tacotron model used for inference. However, it seems the pretrained waveglow model was not trained on the generated aligned mels either. What are your thoughts?
Mismatch problem is one of the most important topics in researches related vocoder. So, recent researches related vocoder take finetuning strategy using synthesized mel.
However, even without these efforts, hifi-gan produces good quality synthesized speech.
I am wondering how would I train this model using the generated GTA mels output from a Tacotron-2 model. Any advice? Thank you!