I'm new to speech synthesis. I've trained my model on the emovdb dataset and want to do inference using the GST part of mellotron. I want to input any text and have it output speech with a certain emotion.
I noticed on issue#20 that someone mentioned the rhythm and pitch created a 1:1 aligment. Can someone explain more in detail about how to do inference without rhythm and pitch?
Hi,
I'm new to speech synthesis. I've trained my model on the emovdb dataset and want to do inference using the GST part of mellotron. I want to input any text and have it output speech with a certain emotion.
I noticed on issue#20 that someone mentioned the rhythm and pitch created a 1:1 aligment. Can someone explain more in detail about how to do inference without rhythm and pitch?