NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
854 stars 187 forks source link

Adding another speaker #61

Closed JakubReha closed 4 years ago

JakubReha commented 4 years ago

I am trying to train the pre-trained model LibriTTS with one more speaker. I've added around 15 minutes of audio from this speaker to the train-clean-100 dataset, added the transcription in 85:15 ratio (train:validation) and increased the number of speakers to 124 in hparams.py. Also all the audio files were resampled to 22 050 Hz, 16 bit. But when I run the inference on the checkpoints I get only noise for all the speakers.

Screenshot 2020-05-08 at 12 47 44 Screenshot 2020-05-08 at 12 48 10
CookiePPP commented 4 years ago

Check that the files are definitely 16 bit and have similar volume to the other speakers. The Source Mel should have more detail. Like this

JakubReha commented 4 years ago

They are 16 bit and the volume of the extra speaker is slightly higher, but the thing is that the source mel is still the same regardless of the speaker.

CookiePPP commented 4 years ago

@JakubReha The audio file path for the source mel is printed here. Would you be able to upload and/or check that audio file? paste

rafaelvalle commented 4 years ago

@JakubReha The mel-spectrogram looks suspicious. Can you share an audio file?

rafaelvalle commented 4 years ago

Closing due to inactivity.