Tomiinek / Multilingual_Text_to_Speech

An implementation of Tacotron 2 that supports multilingual experiments with parameter-sharing, code-switching, and voice cloning.
MIT License
829 stars 158 forks source link

[Question] Model Capacity #29

Closed sooftware closed 3 years ago

sooftware commented 3 years ago

Wouldn't model performance be better if we increase the number of model parameters?
Have you ever done an experiment like this?

Tomiinek commented 3 years ago

Hello!

Yes, I have done some preliminary experiments, but I remember just some impressions and I do not have any measurements.

One language: I have experimented with reducing the size of the encoder and the encoder when training just on LJ Speech (because of training time and limited resources). First, enlarging the decoder or encoder did not improve anything. Reducing sizes of the modules (both or separately to half or lower) had negative effects. My conclusion is that the default parameters of Tacotron are tuned and ok. Note that the decoder is greater than the encoder (I do not remember the ratio, but it could be something like 4-5 times?)

More language: As written in the paper, I compared three models. When setting the generated model, I chose the default Tacotron parameters because of the experiments with LJ Speech. I suppose that the output of the text encoder is somehow language independent (like phonemes, but more complex), so IMHO, the decoder does not have to be scaled up. The same applies to the separate model which has even more parameters than the generated one. The bad results of the separate model are probably because of problems with training (encoders and the decoder are imbalanced, and having two separate optimizers does not help). The situation is more complicated in the case of the shared model. I have experimented with adding language embeddings even to character inputs, but it totally ruined voice conversion abilities. Thus currently, the encoder of the shared model is language agnostic and the language-dependent processing happens in the decoder which is given per-token language embeddings. So I think that scaling the encoder up is not very helpful (but I haven't tried), because the dictionary does not grow with languages -- for example, if we have got the word "coronavirus" in Spanish and "coronavirus" in French, the encoder will process both in the same way. However, enlarging the decoder could be probably helpful (but I have also not tried).

Does it make sense? :slightly_smiling_face: These are my thought more than experiments, would be interesting to try it out ...

sooftware commented 3 years ago

Thank you for your insight. I'll try enlarging the decoder's dimension.
I successfully apply 10 languages [CSS10 (except Greek, Hungarian) + KSS (Korean) + LJ-Speech (English)].

I'll try to apply those 10 languages and plus, 4 languages (Greek, Hungarian, Italian, Jejueo (JSS))
I think it would be very interesting.

Tomiinek commented 3 years ago

Wau, this is super interesting :slightly_smiling_face:

Can you please share your results afterwards? :eyes: Having all those languages including English would be super useful, but I do not currently have time to do it myself :slightly_frowning_face: