NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
855 stars 183 forks source link

ReferenceEncoder did not use the actual mel lengths #71

Closed hubeibei007 closed 4 years ago

hubeibei007 commented 4 years ago

In the ReferenceEncoder, it only use the mels as inputs. When training or batch inference, the mels are padded_data, it seems that the ReferenceEncoder should use the actual mel lengths to get the last GRU hidden state. Please correct me if any wrong understandings.

Add my test results. when ReferenceEncoder get the last hidden state by pack_padded_sequence, in my dataset the clusters of style tokens are more clear than without pack_padded_sequence.

rafaelvalle commented 4 years ago

I assume you modified this code to something of this sort. Can you submit a pull request?

lens = (lens.cpu().numpy() / 2 ** len(self.convs))
lens = lens.round().astype(int)
out = nn.utils.rnn.pack_padded_sequence(
    out, lens, batch_first=True, enforce_sorted=False)
self.gru.flatten_parameters()
_, out = self.gru(out)
hubeibei007 commented 4 years ago

OK, I will prepare and submit a pull request.