NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
855 stars 183 forks source link

Singing voice from audio signal #38

Closed tebin closed 4 years ago

tebin commented 4 years ago

The inference notebook has examples on rhythm/pitch transfer and singing voice from music score, but there is no explanation as to how to synthesize singing voice from audio signal, which is presented on https://nv-adlr.github.io/Mellotron. Is this feature not supported?

rafaelvalle commented 4 years ago

The feature is supported: singing voice from audio signal is similar to rhythm and pitch transfer.

tebin commented 4 years ago

Thanks for the reply. Now that I know I have to expand upon the rhythm/pitch transfer example, I have some follow-up questions.

  1. The paper states that "the target speaker, St, would always be found in the training set, while the source text, pitch and rhythm (Ts, Ps, Rs) could be from outside the training set." so I presume there is no need for speaker ids for source audios - it doesn't make sense after all for some arbitrary input audio outside the training set to have a valid speaker id. However in the examples_filelist.txt there is a column for speaker ids. What is the significance of this column?

  2. The demo website has an example of synthesizing an Indian song using Mellotron LJS and Mellotron Sally. How is it possible to feed Indian text into the models that were trained on English datasets?

rafaelvalle commented 4 years ago
  1. The model expects a speaker id, so we give it a random speaker id.

  2. The model is also trained on phoneme representations called arpabet. During inference one can chose those that match the phonemes in the source audio.

On Fri, Feb 7, 2020, 2:03 PM tebin notifications@github.com wrote:

Thanks for the reply. Now that I know I have to expand upon the rhythm/pitch transfer example, I have some follow-up questions.

1.

The paper states that "the target speaker, St, would always be found in the training set, while the source text, pitch and rhythm (Ts, Ps, Rs) could be from outside the training set." so I presume there is no need for speaker ids for source audios - it doesn't make sense after all for some arbitrary input audio outside the training set to have a valid speaker id. However in the examples_filelist.txt there is a column for speaker ids. What is the significance of this column? 2.

The demo website has an example of synthesizing an Indian song using Mellotron LJS and Mellotron Sally. How is it possible to feed Indian text into the models that were trained on English datasets?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/mellotron/issues/38?email_source=notifications&email_token=AARSFD2LKVVF4BNKG53ZN6DRBXLBTA5CNFSM4KQ44ZZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELEZC3I#issuecomment-583635309, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARSFDZGJAL6Z2HEANSQUZ3RBXLBTANCNFSM4KQ44ZZA .