Closed MorganCZY closed 4 years ago
Well it is strange for both of us. If you enabled amp_run
in parsing arguments, please set amp operational level into O0
to rise up the training precision. If not, you might try the old Mel spectrograms extraction method that it would load Mel spectrograms from numpy files as this code shows. Let us make some comparison.
I didn't activate any options about amp and distributed_run. I'm gonna compare two mechnism of calculating mel spectrogram. Meanwhile, if you have found any clues, pleaes share them out asap. Thanks in advance!
I retrained this repo with the option "load_mel_from_disk", but the alignment is still not correct after 30 epochs.
I forgot to tell you that you need to change the mel_pad_val back into -4 or -5 which should be the lowest Mel value.
I noticed this param, but I have a question to consult. This param is just used as the value of silence when padding. Based on my knowladge, a value that doesn't appear in the numerical range mel is usually chooesed. Then why will -11.5129, which is far smaller than -4 or -5, cause alignment issue? (Btw, I will immediately retrain this repo with modifying this param to -4 to check.)
After carefully check your codes, I'm supposed to simply change mel_pad_val to -4(or the lowest mel value). Because, in your repo, mel-spec is normalized by the function "dynamic_range_compression" in the mode of "load_mel_from_disk=false", after which the minimum of mel is -11.5129(equals to torch.log(1e-5)). Thus, the final mel-specs fed into tacotron2 are in numerical range of [-11.5129, torch.log(mel.max())], which is not a symmetrical interval. I'm not sure whether it it the reason to ruin the training. I wonder when training this tacotron2, is a symmetrical interval of mel value needed?
I have not tested your idea closely, but the implementation in stft.py
is used in many other TTS projects like MelGAN
, WaveGlow
and so on where the Mel values are not symmetrical and the lowest value is -11.5129. But I have no idea whether it ruins the alignment in TTS training.
I retrained the repo with mel-specs calculated in the mode "load_mel_from_disk=False", which calls the line. The alignment is then much more normal. However, most of them are as the the first following picture. Some of them are perfect as the second following picture. I couldn't understand the reason for the first kind of alignment. BEG for your suggestions!
It confused me too. But the numpy file loading approach would lead to the uncertain lowest value of GTA Mel spectrograms due to the ref_min_db bias in Mel extraction. And therefore it would also lead to the uncertainty of Mel padding of the vocoder due to different corpus of anchors.
There is still a way worth trying for the approach in stft.py
, you can comment out preemphasis method in wav loading as well as deemphasis method in griffin-lim function which I added myself. I am not sure if it works. By the way, I did not modify any code in stft.py
.
But the numpy file loading approach would lead to the uncertain lowest value of GTA Mel spectrograms due to the ref_min_db bias in Mel extraction. And therefore it would also lead to the uncertainty of Mel padding of the vocoder due to different corpus of anchors.
The lowest value of GTA mel specs is -4.0, determined by codes here. Thus, I set "mel_pad_val=-4".
You will know that it is not correct when you generate GTA Mel spectrograms.
I've figured out why the alignment shows in such strange way. When ploting alignment, a random index is first generated to decide plotting which one out of a batch alignments. In training stage, one batch data, including texts and mels are padded based on their corresponding max lengths. A random alignment(e.g. with shape[128,151]) out of one batch(with the number of len(alignments)) has (128-seq_len) lines of zeros. Meanwhile its second dim, corresponding to decoding steps, also has some meaningless values because of padding operation here. I modified relevant codes as follows, and now the alignment shows correctly.
Good job! I have an idea that you can try replacing spectral_normalize with amp_to_db to check out whether the alignment get improved.
Actually, in my project, I have totally replaced data processing by functions in common/audio.py and set option "load_mel_from_disk"=True . It seems easier and more stable for training, at least for me.
I trained this repo with BiaoBei dataset and no parameters have been modified. However, it didn't get the perfect alignment even after 278 epoches. Compared to your training situation, would you please give me some suggestions on the improvement of alignment.