Style tokens as guide rather than 1:1 transfer

NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data

BSD 3-Clause "New" or "Revised" License

853 stars 184 forks source link

Style tokens as guide rather than 1:1 transfer #20

Closed richardburleigh closed 4 years ago

richardburleigh commented 4 years ago

Thanks again for the great work on Mellotron.

The usual implementations of Global Style Tokens allow for transfer of style without locking the target inference to a 1:1 rhythm transfer.

For example, using a 1min reference audio with Mellotron appears to be generating a 1min output regardless of the text input, whereas other GST implementations transfer the style without locking in the rhythm / duration of the reference audio 1:1, such as inference of a 5sec sentence on a 1min reference, while still keeping the 'style' of the reference audio.

Is there any change to the model to enable such a scenario?

blisc commented 4 years ago

In reference to the GST part of mellotron, there is no 1:1 lock. You can use GST the same way as in other repos.

If you want to do inference with the mellotron model however, we additionally extract two things from a reference audio: the rhythm and the pitch which creates the 1:1 correspondence. It's the rhythm that creates the 1:1 correspondence actually. But your automatically-extracted pitch might not make sense if you do not additionally condition on the rhythm.

If you don't want rhythm (which you can disable by using model.inferece()) and pitch conditioning (which you can disable by sending zeros as the pitch), you get essentially tacotron 2 with GST and speaker ids.

richardburleigh commented 4 years ago

Thank you @blisc for the quick reply - much appreciated!

kannadaraj commented 4 years ago

thanks @blisc

kannadaraj commented 4 years ago

@blisc .. I have a question on similar lines.. I have trained the model using this repo on LJ speech. During inference i use a out of dataset file as style file. The synthesized speaker quality has changed very much. The quality is decent but it doesn't sound like the original speaker of LJ speech. How to fix that? Please can you help. Thanks.