Rayhane-mamah / Tacotron-2

DeepMind's Tacotron-2 Tensorflow implementation
MIT License
2.27k stars 905 forks source link

Speaker adaptation - Fine tuning #279

Open Clouxie opened 5 years ago

Clouxie commented 5 years ago

Hi there, im looking for some answers about, how to make some king of fine tuning like this : https://github.com/Kyubyong/speaker_adapted_tts In Rayhane tacotron solution. Does anyone know's the answers ? Is it possible ?

m-toman commented 5 years ago

Generally just stopping at some point and swapping out the data and continuing the training works quite well from my experience.

Clouxie commented 5 years ago

Yep , but as result - at "eval / save step" I've got error ( Division by 0 exception )

m-toman commented 5 years ago

Interesting, I didn't try with the latest version (and also not for WaveNet), so perhaps something there.

Clouxie commented 5 years ago

Could you please update me your version ? And so on, I've set eval and checpoint step to 100. And In hparams Start_decay to my actually trained model steps. Is it okey ?

m-toman commented 5 years ago

I'm not doing something special. It worked for me in this repo (https://github.com/m-toman/tacorn/tree/fatchord_model - please note the branch, I'm working on the master branch). So it's more or less the default hparams: https://github.com/m-toman/Tacotron-2/blob/master/hparams.py

But I only adapted the tacotron part, not Wavenet (although I adapted a speaker using r9y9s Wavenet and that worked as well)

Rayhane-mamah commented 5 years ago

Division by 0 bug is because of having 0 batches for your eval data. I assume your fine tuning samples are very few, resulting in 5% being rounded down to 0 batches. Supposing you use batch_size=32 your overall finetuning samples are around 600 samples?

To overcome that, set "test_size" to None and "test_batches=10" for example or whatever number of batches you want to use for validation. That should do.

Let me know if the issue persists :)

On Thu, 29 Nov 2018, 09:36 Clouxie <notifications@github.com wrote:

Im still getting divided by 0 exception, This is the link to my taco files and fine tuning training corpus - It have some silence at the end and beginning. Could u please check if you can train it and eval without any exception ? https://www.dropbox.com/sh/rpazj5ll8ahasr7/AAD2M25jsPTsbeViZdF_UBaba?dl=0

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Rayhane-mamah/Tacotron-2/issues/279#issuecomment-442749520, or mute the thread https://github.com/notifications/unsubscribe-auth/AhFSwE045ydVosz-jm25BVoBqDbvYruuks5uz5x3gaJpZM4Y3SPi .

Clouxie commented 5 years ago

I have around 3 hours of new data...

Clouxie commented 5 years ago

Okey, now it's saving and doing eval fine. We will se if the tune goes well..

hyzhan commented 5 years ago

@m-toman @Rayhane-mamah Fine tune by swapping out the data seem to got a voice that some difference between the fine-tuned data. How to solve this problem if I have not enough data.

Clouxie commented 5 years ago

In my opinion you need at least 15-20 minutes od data. More of data isn't even needed, Im doing some experiments on my setup and I'll let you know which is best. I think that too much amount of data or too long fine tuning can destroy your language information, so im training for max 1-2k steps.

vito11 commented 5 years ago

From my experiments, too much steps will result in an overfitting problem (knowledge loss) , but too few steps will not get a similar sounds. Hope you can find a best way of fine tuning. BTW, maybe https://google.github.io/tacotron/publications/speaker_adaptation/ is a better solution

ryhorv commented 5 years ago

Hey @Rayhane-mamah and @begeekmyfriend! Have you tried to fine tune pretrained model on different voice? How much data did you use for it? How much steps did you train a pretrained model? And what learning rate decay did you use for it?

I pretrained Tacotron on 25h data for 120k steps and then tried to fine tune on 2.5 hours. With constant lr = 1e-5. After ~30k steps my model starts overfit and I stop the training. But quality of the new voice is not good enough. Some words and endings are skipped. Large pauses between words but there are no problems with alignment:

image

begeekmyfriend commented 5 years ago

You may add guided attention loss into this model without any change to your attention model.

def initialize()
    #Grab alignments from the final decoder state
    self.alignments = final_decoder_state.alignment_history.stack()
    alignments = tf.transpose(self.alignments, [1, 2, 0])
...
def add_loss():
    N = 400
    T = 1000
    A = tf.pad(self.alignments, [(0, 0), (0, N), (0, T)], mode="CONSTANT", constant_values=-1.)[:, :N, :T]
    attention_masks = tf.to_float(tf.not_equal(A, -1))
    gts = tf.convert_to_tensor(guided_attention(N, T))
    attention_loss = tf.reduce_sum(tf.abs(A * gts) * attention_masks)
    mask_sum = tf.reduce_sum(attention_masks)
    attention_loss /= .mask_sum
ryhorv commented 5 years ago

I will try. Thank you!

joan126 commented 4 years ago

Hey @Rayhane-mamah and @begeekmyfriend! Have you tried to fine tune pretrained model on different voice? How much data did you use for it? How much steps did you train a pretrained model? And what learning rate decay did you use for it?

I pretrained Tacotron on 25h data for 120k steps and then tried to fine tune on 2.5 hours. With constant lr = 1e-5. After ~30k steps my model starts overfit and I stop the training. But quality of the new voice is not good enough. Some words and endings are skipped. Large pauses between words but there are no problems with alignment:

image

hi, did you freeze encoder when you finetune on new voice dataset? and do you have good quality on new voice now?