NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
855 stars 183 forks source link

How to choose a target speaker for generating voice #74

Closed deepuvikraman closed 4 years ago

deepuvikraman commented 4 years ago

Hi, I am using a libritts pre-trained model to generate a speech from a custom text as input to the inference. I gave a reference audio with wav and its corresponding transcript as another input

with torch.no_grad(): mel_outputs, mel_outputs_postnet, gateoutputs, = mellotron.inference( (text_encoded, mel, speaker_id, pitch_contour))

I am able to generate voice and the style is getting applied from the reference audio. But, I am always getting the voice of a female speaker. How can I choose a specific speaker id, so that I can get a voice of that speaker for the speech generated by the model?

Thanks Deepu

rafaelvalle commented 4 years ago

Yes, you certainly can choose a specific speaker_id

deepuvikraman commented 4 years ago

Thanks for the reply @rafaelvalle . But how do I choose a speakerid? I tried giving the speakerid in the "examples_filelist.txt" along with the reference style audio and its transcript. But the generated voice is always same irrespective of different speakerids. eg " data/example1.wav|exploring the expanses of space to keep our planet safe|100"

rafaelvalle commented 4 years ago

For selecting the speaker during inference you need to change speaker_id in the notebook. The speaker id in examples_filelist.txt only determines which speaker will be used to extract the token durations.

deepuvikraman commented 4 years ago

Thanks @rafaelvalle . Let me try this out

GauriDhande commented 4 years ago

@deepuvikraman were you able to find out which line to change in the notebook for different speakers?