如何训练标贝数据集？

xinzheshen commented 5 years ago

@begeekmyfriend 您好，请问您用这个模型训练过标贝数据集吗？另外除了text-cleaners外，中文与英文的训练参数还有什么明显区别吗？我用原英伟达的repo训练标贝数据集，设置 text-cleaners 为transliteration_cleaners，batch_size = 32, 其他的保持其默认参数，其中epochs=1501, 但我训练完后，将tacotron2的输出mel谱喂给英伟达的预训练的waveglow，但合成效果很差。我试过将从中文语音计算得到的mel谱喂给waveglow是正常的，说明这个声码器是没问题的。不知道是训练epochs不够，还是某些参数不合适，想像您请教一下，多谢多谢。 Screenshot from 2019-11-29 09-41-41

begeekmyfriend commented 5 years ago

Please allow me answer your question in English to make everybody understand it.

My mel spectrograms range from [-4, 4] which is compatible with the hyper parameters of Rayhane Mamah's Tacotron-2. Therefore I have to set mel_pad_val hyper parameters in my repo. Maybe you need to do some linear transform on mel output to make WaveGlow vocoder work good like (mel + 5) / 2 * 5. Of course you might use mel.min and mel.max to verify this.

Besides, I have provided G&L method to convert mel outputs into audio as evaluation. Why not run bash script/inference_tacotron2.sh to verify it?

begeekmyfriend commented 4 years ago

Sorry that you might expand worker number of data loader for training acceleration https://github.com/begeekmyfriend/tacotron2/commit/35e7f455372707b3e7146ddda7bdb45f7de5daa5

begeekmyfriend commented 4 years ago

My fault https://github.com/begeekmyfriend/tacotron2/issues/5

xinzheshen commented 4 years ago

@begeekmyfriend 多谢您的回复。其实我用的是NVIDIA的code 0970653 训练的tacotron2模型，基于标贝数据集，只将text-cleaners 改为了 basic_cleaner， batchsize设置为64, 其余的都是train_tacotron2.sh中默认的参数。目前训练到1900步，损失已经平稳，并用NVIDIA提供的预训练的waveglow作为声码器，得到的音频质量不是特别好（音频内容为：长城是古代中国）。接下来不知道该怎么调整了，想听听您的建议。不知道中文和英文训练相比，有没有什么需要特别注意的地方。再次打扰，谢谢。 Screenshot from 2019-12-11 10-38-35 align_1981_151 audio_2.wav.tar.gz

begeekmyfriend commented 4 years ago

还是建议用我的版本，下图分别是Epoch 9（迭代2286次），以及Epoch 20（迭代5080次）的对齐图（标贝效果应该类似），后续效果一直稳定。据实测，训练速度要比Tacotron-2（Tensorflow版本）快一点——当然显存占用稍大，可加大reduction factor或者减小batch size，目前尚没有用apex优化。 align_0009_2286 align_0020_5080

xinzheshen commented 4 years ago

@begeekmyfriend 谢谢。请问您的版本大概需要多少个epoch趋于平稳呢，因为留给我的时间不多了哈哈？另外您用的是什么声码器呢？或者有没有与您的版本对应的声码器呢？不知道我想的是否正确，其实我之所以想训练NVIDIA的版本，是因为可以和他的预训练的waveglow无缝对接，不然还得自己自己训练与其他版本匹配的声码器。

begeekmyfriend commented 4 years ago

声码器选项在这里，两者选其一，其实你想训练WaveGlow也可以，个人推荐WaveRNN.至于多少个epochs，原谅我没有提供tensorboard可视化，个人经验，stop token loss降为0即可，大概50多个epochs吧，你可以用bash scripts/inference_tacotron2.sh来检验Griffin-Lim

xinzheshen commented 4 years ago

@begeekmyfriend 非常感谢您的耐心回复。只要200个epoch就可用了吗？我看其他的预训练模型动辄几十几百k啊，难道是指总的iter数吗？另外，我看您的wavernn中关于stft的参数和当前版本的tacotron2的不一致，没影响吗？

begeekmyfriend commented 4 years ago

每个人的语料数目不一样，有的epoch高达上千条样本（比如多人），所以你可以自己算一下，在scripts/train_tacotron2.sh里设置。另外，WaveRNN超参已经适配，只要把GTA的结果放在根目录即可训练https://github.com/begeekmyfriend/WaveRNN/commit/f6cb1a3d6d6b58dbbba5301a88d05c1beb9230af

xinzheshen commented 4 years ago

@begeekmyfriend 非常非常感谢。想弱弱问个科普性问题。我在语音生成code中经常看到GTA这个词，但不理解它的意思，是指用训练好的tacotron2生成mel谱，并将其丢给wavernn训练？为什么这么做呢？主要是没理解怎么给您的版本的wavernn准备数据，求赐教。再次感谢。

begeekmyfriend commented 4 years ago

Ground truth aligned means evaluation from the training data. The final results of the vocoder are inferred from what we feed with the T2 evaluated mel spectrograms as the inputs. In my experience, the structure of data directory in WaveRNN shall be listed as follows:

wavernn
└── data
        └── voc_mol
           ├── gta/*.npy
           └── quant/*.npy

And you might type such command line to start training:

python train_wavernn.py --gta

As for the ./data/voc_mol/quant directory, just rename training_data/audio generated by preprocess.py.

begeekmyfriend commented 4 years ago

By the way, I would like to provide some length matching program to ensure the alignment between wav samples and mel hops.

import os
import numpy as np

hop = 256
mins = []
maxs = []
basedir = 'gta'
for f in os.listdir(basedir):
        gta = os.path.join(basedir, f)
        quant = os.path.join('quant', f)
        mel = np.load(gta)
        gta_len = mel.shape[1]
        wav_len = np.load(quant).shape[0]
        assert(gta_len * hop == wav_len)
        mins.append(mel.min())
        maxs.append(mel.max())

print(sorted(mins)[0], sorted(maxs)[-1])

xinzheshen commented 4 years ago

@begeekmyfriend 多谢不吝赐教。明白了，就是gta下放置用训练好的tacotron2中的gta生成的mel 的.npy文件，quant下放原始音频的.npy文件哈。等我训练好tacotron2后试一下。再次感谢。

xinzheshen commented 4 years ago

@begeekmyfriend 您好，在您的指教下，我的语音合成进展很大，再次感谢。wavernn还在继续训练中，不过loss很快就降到了2.7左右后就停滞了，效果还行，就是有点不稳定。因此，请教个问题，为什么同样一段文本，每次的合成效果会有不同呢，按理说在推断时，所有参数值都是固定的，应该同一个输入对应同一个输出啊。然后，我比较了两次的mel谱结果，发现结果不完全一致，很好奇是什么导致了这个问题，您有了解吗？

begeekmyfriend commented 4 years ago

T2有dropout，请先用G&L确认一下。另外，如果 https://github.com/begeekmyfriend/tacotron2/issues/4#issuecomment-564927028 没问题的话，请等待到600K再测试。

bash scripts/inference_tacotron2.sh

xinzheshen commented 4 years ago

@begeekmyfriend 谢谢，我的wavernn已经训练到640k了，我已经分别用了GL和wavernn合成音频。我的理解是，在推断过程中，执行了model.eval(), 模型就不会dropout了，不是吗？

begeekmyfriend commented 4 years ago

No, unless you set dropout rate as zero.

xinzheshen commented 4 years ago

@begeekmyfriend OK. 我看网上都这么说，哈哈，我刚大概看了下源码，我觉得是代码的问题。估计用nn.Module.Dropout 而不是用functional.dropout() 估计就没这个问题了。明天我测试下。

begeekmyfriend commented 4 years ago

这两者相互之间是wrapper关系吧，接口版本不同，没本质区别。

xinzheshen commented 4 years ago

@begeekmyfriend 不好意思，是我看错了代码中调用functional.dropout()时传的参数。但是如果在推断过程中，执行了model.eval(), 模型就不会dropout了，这个是正确的。之所以每次输出不一致，是因为没设置随机种子，因为在推断时prenet中有个生成伯努利分布导致的。

begeekmyfriend commented 4 years ago

The model.eval method would be called here. The inference prenet implementation is deprived from NVIDIA source.

xinzheshen commented 4 years ago

我看到了model.eval() 被调用了，所以当时才好奇为什么每次输出不一样，我设置随机种子后，现在一致了。

begeekmyfriend commented 4 years ago

Your PR would be appreciated.

xinzheshen commented 4 years ago

@begeekmyfriend Thank you very much for your help.

leijue222 commented 4 years ago

@xinzheshen 您好，我想请教您几个问题。

您taco2输出的mel取值范围在【-4,4】这个区间吗？
您在用wavernn合成wav的时候，为什么要做mel+=hp.mel_bias的处理？并且设置hp.mel_bias=2?

begeekmyfriend / tacotron2

如何训练标贝数据集？ #4