Closed trfnhle closed 3 years ago
Cross language synthesis is an interesting direction, but it's a more difficult problem than single speaker single lang synthesis I think.
if we have a large dataset for single speaker but multiple languages, that is fine, we just need mixing preprocessing text (for example, use character on target language + CMU for English language). But how can we solve this problem if we don't have enough data for single speaker + multiple languages ?. Almost I think we will have dataset for multi-speaker but almost single language :v.
@dathudeptrai @l4zyf9x may be we can create a dataset from speech synthesis to get all multi-speaker to one speaker (style transfer) and the train the model :thinking:
@manmay-nakhashi you mean training multispeaker with diferrent language then generate soft dataset and distillate it on different model? In my opinion, soft dataset will not have good quality sound, so distillate on this dataset will not provide good result. I wonder whether there be a way to distillate language infomation into embedding vector such like speaker embedding?
I think making a single speaker speak many languages fluently is something worth try. Assume that we have 1) Three languages, possibly closer like Portuguese & Spanish or very different like English & Japanese. 2) A good quality single speaker dataset for each language (regardless of speaker gender; phoneme representation is obtainable)
What can we expect from model trainings under following settings?
Setting 1:
Setting 2:
Any other recommended settings?
@tekinek In setting 1 without step 5, I think it will meet a problem that generated audio will interweave multi-speaker voices together because we use non. This is not the thing we want
I prefer setting 2 because languages have often shared some phoneme pronunciation, So we leverage this information will help model just focus on learning different phonemes between two languages. But I would like to use both language embedding and speaker embedding.
The important thing as you mention that is how we transfer language information. So we need to find a way to separate speaker information and language information. I think when we isolate those features as speaker embedding and character embedding and language embedding for each character, it is easy to combine those features and produce a speaker that speaks multi-language.
Assume we have two target datasets (EN, FR) have good quality, we could combine with ASR dataset with a lot of speakers (maybe low quality, short time for each speaker, but it is easy to collect) then training the model. Due we mix a lot of speaker for both languages, so language embedding will try to hold the general feature of language not specify for any speaker. And we isolate the feature we need. Maybe this will work :thinking:
@l4zyf9x Thanks for your insightful thoughts.
we mix a lot of speaker for both languages, so language embedding will try to hold the general feature of language not specify for any speaker
So you mean, in case of multilingual training, what is important (and possibly difficult) is to learn a good language embedding which helps the model to decide how to pronounce a shared phoneme in each enrolled language? Why don't you think just one good dataset for each language is enough to learn this embedding? Introducing ASR dataset may confuse the model since the goal of TTS is to learn how to speak clearly.
@tekinek
Why don't you think just one good dataset for each language is enough to learn this embedding?
I just think one good dataset for each language just are too small and not diversity to learn general ligual feature. You right, ASR dataset could cause noise, but we could first train on a mixed dataset and then finetune just only target dataset. Hopefully, the model still remains lingual feature learned from the mixed dataset.
ASR dataset doesn't work, I have already tried ASR dataset (English/Mandarin) with tacotron2, the outputs are full of glitches.
To prove that the multi-lingual is trainable,
I'm experimenting tacotron2 with a dataset,
which was all generated by Google Speech Synthesis (Speaker: cmn-TW-Wavenet-A
) in languages: Mandarin(90%) + English(10%),
I used this dataset to simulate the condition: single speaker
+ large data size (100K wav files)
+ multi-lingual
,
And I think use such dataset to train a base model for transfer learning should keep more plentiful lingual features than ASR dataset.
The following is the performance so far:
以下是您搜尋的型號ABC一百D
, the output could be audible (waveform with pretrained EN Multi-band MelGAN).stop token loss
will become worse, so the waveform in the ending interval could be a lot of noise.I'm not sure if I need to adjust the parameters to increase the number of neurons. If so, can someone suggest which parameter I should increase first?
@mychiux413 You mention Google Speech Synthesis. is it this one google-tts. I just synthesize some random English sentences with speaker cmn-TW-Wavenet-A from this page. Chinese-speak-English Audio is audible but has a quality not good as Chinese-speak-Chinese. I think it may be the reason. Besides the dataset problem, do you try to add language embedding to each token? I think it will help the model clarify between English sentences and Chinese sentences.
@l4zyf9x Yes, it comes from google-tts, and thanks for your suggestion, I will try it.
@mychiux413 Excuse me, could you please tell me where I can get the Chinese-English mixed dataset?I want to use tacotron2 to try a Chinese-English mixed TTS model.
@leijue222 I don't have Chinese-English mixed dataset, so I just collect some Chinese-English text online by crawler or some open-data(I don't remember where it comes from), then I use google-tts to generate speech data and package them as a dataset.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Linking a relevant paper: Building Multi lingual TTS using Cross Lingual Voice Conversion
The repo at https://github.com/Tomiinek/Multilingual_Text_to_Speech uses both a speaker embedding as well as a language embedding. Why not implement both embeddings?
Hey you guy, Fastspeech and Tacotron achieved good results on a single language, but how about a voice synthesis of one speaker for many languages (>=2)? It is so obvious that we could create a dataset of single speaker for multiple languages, but it is not easy to create such that dataset because of many reasons (good speaker pronunciation for many languages, different phonemes between languages,...). So how do you think about this problem:question: :smile: