cnlinxi / style-token_tacotron2

style token with tacotron2
MIT License
61 stars 16 forks source link

the trained model generates different wavs with the same text and reference audio #10

Open MorganCZY opened 4 years ago

MorganCZY commented 4 years ago

When doing tests, I found each time I ran the synthesize.py(with the same text and reference audio), I got different results(namely different syntheized wavs). After looking up the code, I didn't find there are random operations when synthesizing. Could you give me some explanations?

cnlinxi commented 4 years ago

Please specify reference audio's path in the 'tacotron_style_reference_audio' of hparams.py, then synthesizing. Feel free to raise more questions.

MorganCZY commented 4 years ago

Yes,I have specified the reference audio path in hparams.py

在 2020年9月17日,18:59,cnlinxi notifications@github.com 写道:

Please specify reference audio's path in the 'tacotron_style_reference_audio' of hparams.py, then synthesizing. Feel free to raise more questions.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

cnlinxi commented 4 years ago

In hparams.py:

tacotron_style_alignment=None,

you can manually specify style token alignment weights instead of getting them from reference audio.

Do you mean this?

MorganCZY commented 4 years ago

image Here are my hparams settings. I specify a reference audio path, which will be sent to gst module(namely the reference encoder). For a trained model, the weights of encoder, decoder, attention and gst are all fixed. So, basically I can't understand why I will get different wavs with the same text and the same reference audio as input, considering that there seems no random operations in the code.

cnlinxi commented 4 years ago

@MorganCZY In the original Tacotron-2, dropout was turned on during inference, and so is this one. So, every time you generate wav, the audio will be different.

CathyW77 commented 4 years ago

我也想问下这个问题, 那我如果想针对同一个tacotron_style_reference_audio 使得每次出来的音频都是相同的,要如何操作呢

cnlinxi commented 4 years ago

@CathyW77 在生成时,关闭prenet中的dropout应该就可以了。在tacotron/models/modules.py中Prenet类中,有:

x = tf.layers.dropout(dense, rate=self.drop_rate, training=True, name='dropout_{}'.format(i + 1) + self.scope)

tf.layers.dropout()中的参数training在生成时,置为False。

MorganCZY commented 4 years ago

It's indeed the only random opration during synthesizing process after searching the whole repo. But both of setting "training=False" or "training=self.is_training" in prenet can not then generate correct wavs.

cnlinxi commented 4 years ago

@MorganCZY What does correct wav mean? Can't generate audio?

MorganCZY commented 4 years ago

samples.zip true.wav--->"training=True"; self_is_training.wav--->"training=self.is_training"; false.wav--->"training=False"

cnlinxi commented 4 years ago

@MorganCZY This completely failed. Can you show the sample of your training corpus and the alignment during training?

MorganCZY commented 4 years ago

I trained this model with thch30s. alignment.zip Here are the latest three alignment graphs, corresponding to 6w, 6.5w, 7w steps.

CathyW77 commented 4 years ago

@cnlinxi 我设置为false之后就出来的音频都是有问题了,发不出任何一个字,改为true又能正常生成

cnlinxi commented 4 years ago

@CathyW77 欸,为啥,这个好奇怪。不过我确实没有尝试过关闭这个dropout。

cnlinxi commented 4 years ago

I trained this model with thch30s. alignment.zip Here are the latest three alignment graphs, corresponding to 6w, 6.5w, 7w steps.

@MorganCZY

This is a bit strange. I'm sorry that I do not know what happened. The alignment is good, you can check your synthesis.