NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
854 stars 187 forks source link

Dimensions mismatch when using pretrained model #44

Closed mathigatti closed 4 years ago

mathigatti commented 4 years ago

Hi, thanks for sharing this great project.

I receive some errors trying to fine tune the pretrained model you provided.

size mismatch for embedding.weight: copying a param with shape torch.Size([148, 512]) from checkpoint, the shape in current model is torch.Size([185, 512]).

size mismatch for decoder.attention_rnn.weight_ih: copying a param with shape torch.Size([4096, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1281]).

size mismatch for decoder.attention_layer.memory_layer.linear_layer.weight: copying a param with shape torch.Size([128, 512]) from checkpoint, the shape in current model is torch.Size([128, 1024]).

size mismatch for decoder.decoder_rnn.weight_ih: copying a param with shape torch.Size([4096, 1536]) from checkpoint, the shape in current model is torch.Size([4096, 2048]).

size mismatch for decoder.linear_projection.linear_layer.weight: copying a param with shape torch.Size([80, 1536]) from checkpoint, the shape in current model is torch.Size([80, 2048]).

size mismatch for decoder.gate_layer.linear_layer.weight: copying a param with shape torch.Size([1, 1536]) from checkpoint, the shape in current model is torch.Size([1, 2048]).

Maybe the pretrained model example was trained with a different code? For example n_symbols produce some of the problems, after replacing it with the one from tacotron that problem is solved. Do you know how could I solve the other problems?

Thanks again :)

mathigatti commented 4 years ago

My mistake! I used the tacotron2 checkpoint.