NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
854 stars 187 forks source link

synthesized speaker quality changed #46

Open kannadaraj opened 4 years ago

kannadaraj commented 4 years ago

Thanks fro sharing the repo. I have trained the model using this repo on LJ speech. I am performing inference using only GST. During inference i use a out of dataset file as style file. The synthesized speaker quality changes very much. The synthesized quality is decent but it doesn't sound like the original speaker of LJ speech. How to fix that? Please can anyone help. Thanks.

rafaelvalle commented 4 years ago

Are you selecting one style token or using a sound file to sample the style tokens?

kannadaraj commented 4 years ago

@rafaelvalle : Sorry for late reply. I am training with single speaker database, I am using a file sample. from the same data set.

rafaelvalle commented 4 years ago

Do the attention maps look correct?

kannadaraj commented 4 years ago

Yes. the attention maps look good. Good diagonal line..

rafaelvalle commented 4 years ago

Can you share mel-spectrograms, audio files and attention plots?