NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
854 stars 184 forks source link

The distinction between different speaker with mandarin dataset is not obvious. #31

Open chynphh opened 4 years ago

chynphh commented 4 years ago

When using multi-speaker data, can not distinguish between male and female, and there are only slight differences between different speakers. Current training step is 32K. Is this normal? The language is mandarin.

rafaelvalle commented 4 years ago

Is the loss still going down and model is not overfitting, i.e. generalization error is not increasing?

gongchenghhu commented 3 years ago

I also use multi-speaker data, include half of biaobei, a private male datasets(5000 sentences), and other 4 private small datasets, totally 6 speakers. When I use this datasets to train mellotron, I can't get a good alignment, the alignment like the below figure image But while I only use the single speaker dataset bioabei, the alignment is good. So I want to know, what's your mandarin multi-speaker data look like, how many speakers? And 6 speakers is enough? @chynphh