NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
855 stars 183 forks source link

Singing Voice from Music Score #53

Closed Sangkikim-77 closed 4 years ago

Sangkikim-77 commented 4 years ago

When I synthesis using music score(musicXML), I have to use "mel" by input parameter. However, if you look at the code on inference.ipynb provided, the input parameter, Mel, is using the mel of the dataloader, which has nothing to do with Hallelujah of the Haendel.

Can I use any mel?

rafaelvalle commented 4 years ago

The mel is used for the Global Style Tokens.

rafaelvalle commented 4 years ago

Closing due to inactivity.

moonnee commented 4 years ago

Hi, I found that when synthesizing from Hallelujah musicxml, changing the mel input seems

The mel is used for the Global Style Tokens.

Hi, thanks a lot for your work! I am also curious about why need GST when synthesizing from musicxml. Actually, we do not have a mel when synthesizing from musicxml. If the task is to synthesize from musicxml, we do not need GST part during training right?

rafaelvalle commented 4 years ago

The Global Style Tokens can inject the style (screaming, whispering, etc...) in a mel-spectrogram while synthesizing speech. If you remove GST during training, you won't be able to take advantage of this.