TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.8k stars 810 forks source link

About multi-speaker datasets and tacotron2... #644

Closed samuel-lunii closed 2 years ago

samuel-lunii commented 3 years ago

Hi ! I would like to ask a few questions about multi-speaker datasets. This thread gives good insights about what is needed for transfer learning with short duration for each speaker in a multi-speaker dataset using FastSpeech.

My questions are about Tacotron2.

Thanks !

dathudeptrai commented 3 years ago

@ZDisket do you have any ideas?

ZDisket commented 3 years ago

@dathudeptrai @samuel-lunii I haven't had much success training multispeaker Tacotron2 as the attention mechanism becomes unable to learn alignment when it is multispeaker, even for datasets that are big, good and train successfully on FastSpeech2, so I would suggest using that for multispeaker instead. Maybe we can fix the issues of fixed durations for outputs by using the stochastic duration predictor from VITS, which has variations even with the same input, instead of the regular one in FastSpeech2. I'll try to answer every question

What is the minimal total duration needed to train a good multi-speaker model from scratch ?

I would say about 15 to 20 hours total

What is the minimal duration needed for each speaker in the dataset ?

In my limited experiments, those that perform well have at least 30 minutes

Would it be good practice to use large durations (e.g. 1 or 2 speakers with duration > 20 hours) in combination with shorter ones (e.g. several speakers with duration ~ 1 hour) to train a single multi-speaker model ? Or is it better to perform transfer learning with short durations from a model trained on large durations ?

You should include the speakers with little data into the big model, as when you try to fine-tune, unless it's the same amount of speakers, the embedding layer will be dropped due to incompatibility, and that will mess the new model up.

samuel-lunii commented 3 years ago

@ZDisket Thanks for your answers.

I haven't had much success training multispeaker Tacotron2 as the attention mechanism becomes unable to learn alignment when it is multispeaker, even for datasets that are big, good and train successfully on FastSpeech2, so I would suggest using that for multispeaker instead.

So I guess your answers are about FastSpeech2 ? I actually want to use Tacotron2 for expressive speech synthesis.

As mentioned in #628, I think it could be worth adapting the Tacotron2 model in a way similar to the prosody paper (also used in the GST paper), where the speaker embedding is broadcasted to match the input text sequence length and then concatenated to the encoder output. I did have correct results by doing so with keithito's repo. I will let you know if I meet any success by doing it with this one :)

By the way, @dathudeptrai @ZDisket
I see that the speaker embeddings are used before and after the encoder. Is there a particular reason to do so ? Have you tried to use either the one or the other independently ?

dathudeptrai commented 3 years ago

@samuel-lunii

I see that the speaker embeddings are used before and after the encoder. Is there a particular reason to do so ? Have you tried to use either the one or the other independently ?

Add before ->>> speaker-dependently encoder Add after ->>> speaker-dependently decoder

You can imagine it as Resnet. Yoy should use speaker embeddings for both encoder and decoder phrase :D

ZDisket commented 3 years ago

@samuel-lunii

I actually want to use Tacotron2 for expressive speech synthesis.

Define "expressive". FastSpeech2 is good at expressiveness, it's just the deterministic duration predictions that are the problem. I've had success adding an emotion embedding (a copy of the speaker embedding) for training with a multispeaker multi-emotion dataset that is categorized by those.

samuel-lunii commented 3 years ago

@ZDisket Ok good to know, thanks :) I will definitely have a go with multi-speaker FastSpeech2. However, I would like to use "GST like" architecture, in order to avoid emotion labelling...

ZDisket commented 3 years ago

@samuel-lunii

to avoid emotion labelling...

Someone I know used an automatic sentiment detection from text labeling model and used its output as emotion embedding.

samuel-lunii commented 3 years ago

@ZDisket Yes, I was planning on doing something similar too :) I will follow your advice and try all this with FS2.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.