NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
853 stars 187 forks source link

Help needed w.r.t. inference #90

Open nirmeshshah opened 3 years ago

nirmeshshah commented 3 years ago

Hi I have few doubts ,

1) Is example1.wav the reference audio file whose style has to be captured while synthesizing samples in inference.ipynb? Do I need to have text and corresponding wav file for inference in mellotron well in advance ?. Usually, I have text which I want to synthesized and reference audio of completely different utterance whose style has to be captured. I am unable to map this with the existing inference.ipynb. Can anyone please give some more clarity on this ?

2) How to run this model as a standalone TTS ?

3) If I have trained my model on single speaker data how can I update Define Speaker Set section in inference.ipynb as it seems it is given for multispeaker only.