NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
853 stars 187 forks source link

Inference without rhythm and pitch #103

Open kngan43 opened 2 years ago

kngan43 commented 2 years ago

Hi,

I'm new to speech synthesis. I've trained my model on the emovdb dataset and want to do inference using the GST part of mellotron. I want to input any text and have it output speech with a certain emotion.

I noticed on issue#20 that someone mentioned the rhythm and pitch created a 1:1 aligment. Can someone explain more in detail about how to do inference without rhythm and pitch?