NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
855 stars 183 forks source link

Synthesized voice does not correspond to the speaker id #54

Closed paarthneekhara closed 4 years ago

paarthneekhara commented 4 years ago

I was using the inference notebook with the pre-trained models. I noticed that the synthesized audio does not always correspond to the speaker id. For many male speakers, the audio still sounds like that from a female speaker. I tried using both the inference function and inference_noattention functions. Is this an issue faced by anyone else?

Sangkikim-77 commented 4 years ago

Hi,

Training using a pre-trained model can lead to faster convergence By default, the speaker embedding layer is ignored

mr-muyu commented 4 years ago

the pre-trained model does have speaker embedding as you can load the model and see that layer. But it does seem to be quite picth/rythm related. you can try to extract pitch and rythm from a different wav to see/test

paarthneekhara commented 4 years ago

Nevermind, I think there was a bug in loading the speaker dictionary in the inference on my end. Although for some speakers, the voice does not quite match the data. Maybe because of fewer corresponding speaker samples during training.

deepuvikraman commented 4 years ago

@paarthneekhara - I am also facing this similar issue. I am using libritts pretrained model and trying to generate voice for a custom text using a reference style wav file. Though I specify a different speake_rid (in the example_filelists.txt along with style wav, its transcript), the voice generated is always same of a female voice. Do you know how to generate voice of a different speaker that is present in the pre-trained model?