NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
853 stars 187 forks source link

Fix mel inputs passed to GST Reference Encoder #94

Open ilya16 opened 3 years ago

ilya16 commented 3 years ago

In the current implementation, mel inputs passed to GST module (targets) have shape [B, n_mel_channels, T_out] and are reshaped to [B, 1, T_out, n_mel_channels] by GST Reference Encoder. As a result, Reference Encoder works not with original spectrograms.

This PR fixes the shape of GST inputs and adds an additional inputs shape check in ReferenceEncoder.