NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
853 stars 184 forks source link

Generic Inference #4

Closed CookiePPP closed 4 years ago

CookiePPP commented 4 years ago

I'm trying to perform inference without a file for copying style from, just using this repo like a multi-speaker Tacotron2 without GSTs. What code do I need for that?

In Tacotron2 you only need the Text Input to use tacotron.inference but I'm not sure how to do inference here where I just input, Text, SpeakerID.

Your notebook has 2 examples of "tacotron.inference_noattention" and how to get the inputs for them, but no examples for just "tacotron.inference", and I'm having trouble telling from the source code.

rafaelvalle commented 4 years ago

https://github.com/NVIDIA/mellotron/blob/master/model.py#L611 It takes text, style_input (mel or int), speaker_ids, f0s

CookiePPP commented 4 years ago

I noticed, what should f0s be when I'm trying to generate new audio without a reference file?

blisc commented 4 years ago

You can simply pass in a vector of 0s for f0 if you want your model to predict it instead of specifying it. You'd probably have to play around with the shape of the vector but its size should be your batch size.

CookiePPP commented 4 years ago

Alright, I'll give that a shot. Thanks