Adaptation - Githubissues

NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data

BSD 3-Clause "New" or "Revised" License

855 stars 183 forks source link

Adaptation #39

Open karkirowle opened 4 years ago

karkirowle commented 4 years ago

I've been trying to run some adaptation experiments with Mellotron, i.e try to use small amount of data (less than an hour) to shift the acoustics of an existing speaker towards a different speaker. I.e. even if there is not a large amount of data from a male/female singer, it should be possible to move the acoustics by retraining with a similar speaker's id.

My experiment's haven't been succesful so far, interestingly, I found that even the other speakers get affected during adaptation, and meaning becomes quickly uninterpretable.

Have you tried something like that? What layers should be ignored for adaptation?

rafaelvalle commented 4 years ago

I assume you're training on a single speaker. Instead, train on all LibriTTS + your own data.

aishweta commented 3 years ago

I have a multi-speaker Data set (male and female both). Total speaker = 21 The total size of data = 13.36 hours Total audios ( audio length is between 3 to 10 seconds) = 14453

Each speaker has recorded around 40 minutes of audio.

I would like to train multi-speaker data using this repo.

should I train using existing LibriTTS pretrained weights?
or train on all LibriTTS + my data from scratch?

Any comments @rafaelvalle

aishweta commented 3 years ago

Hi @rafaelvalle could you please answer to the questions?

karkirowle commented 3 years ago

@shwetagargade216 I would train with both if you are training with LibriTTS. You are going to train a multi-speaker setup anyway, so more data can only benefit in that case. If you have a different language, I would maybe try only your data. But these things you usually cannot know in advance. If you train with both, make sure the format is the same, i.e sampling frequency.

aishweta commented 3 years ago

@karkirowle From scratch training might be time-consuming and cost-effective, would like to try transfer learning first using libritts dataset.

And I do have an English language dataset.