Cross-Lingual speech synthesis discussion.

TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)

https://tensorspeech.github.io/TensorFlowTTS/

Apache License 2.0

3.82k stars 812 forks source link

Cross-Lingual speech synthesis discussion. #143

Closed trfnhle closed 3 years ago

trfnhle commented 4 years ago

Hey you guy, Fastspeech and Tacotron achieved good results on a single language, but how about a voice synthesis of one speaker for many languages (>=2)? It is so obvious that we could create a dataset of single speaker for multiple languages, but it is not easy to create such that dataset because of many reasons (good speaker pronunciation for many languages, different phonemes between languages,...). So how do you think about this problem:question: :smile:

candlewill commented 4 years ago

Cross language synthesis is an interesting direction, but it's a more difficult problem than single speaker single lang synthesis I think.

dathudeptrai commented 4 years ago

if we have a large dataset for single speaker but multiple languages, that is fine, we just need mixing preprocessing text (for example, use character on target language + CMU for English language). But how can we solve this problem if we don't have enough data for single speaker + multiple languages ?. Almost I think we will have dataset for multi-speaker but almost single language :v.

manmay-nakhashi commented 4 years ago

@dathudeptrai @l4zyf9x may be we can create a dataset from speech synthesis to get all multi-speaker to one speaker (style transfer) and the train the model :thinking:

trfnhle commented 4 years ago

@manmay-nakhashi you mean training multispeaker with diferrent language then generate soft dataset and distillate it on different model? In my opinion, soft dataset will not have good quality sound, so distillate on this dataset will not provide good result. I wonder whether there be a way to distillate language infomation into embedding vector such like speaker embedding?

tekinek commented 4 years ago

I think making a single speaker speak many languages fluently is something worth try. Assume that we have 1) Three languages, possibly closer like Portuguese & Spanish or very different like English & Japanese. 2) A good quality single speaker dataset for each language (regardless of speaker gender; phoneme representation is obtainable)

What can we expect from model trainings under following settings?

Setting 1:

Map each language's phoneme set into some arbitrary characters so that no phoneme overlap between languages.
Accordingly update each dataset with artificiality phoneme map from step 1.
Mix three dataset without marking language or speaker information,
Train a single speaker model with this mixed dataset.
Employ style transfer to make model speak every language in same style (how?)

Setting 2:

Use same phoneme set for three languages.
Use a language embedding instead of speaker embedding.
Employ style transfer to make model speak every language in same style (how?)

Any other recommended settings?

trfnhle commented 4 years ago

@tekinek In setting 1 without step 5, I think it will meet a problem that generated audio will interweave multi-speaker voices together because we use non. This is not the thing we want

I prefer setting 2 because languages have often shared some phoneme pronunciation, So we leverage this information will help model just focus on learning different phonemes between two languages. But I would like to use both language embedding and speaker embedding.

The important thing as you mention that is how we transfer language information. So we need to find a way to separate speaker information and language information. I think when we isolate those features as speaker embedding and character embedding and language embedding for each character, it is easy to combine those features and produce a speaker that speaks multi-language.

Assume we have two target datasets (EN, FR) have good quality, we could combine with ASR dataset with a lot of speakers (maybe low quality, short time for each speaker, but it is easy to collect) then training the model. Due we mix a lot of speaker for both languages, so language embedding will try to hold the general feature of language not specify for any speaker. And we isolate the feature we need. Maybe this will work :thinking:

tekinek commented 4 years ago

@l4zyf9x Thanks for your insightful thoughts.

we mix a lot of speaker for both languages, so language embedding will try to hold the general feature of language not specify for any speaker

So you mean, in case of multilingual training, what is important (and possibly difficult) is to learn a good language embedding which helps the model to decide how to pronounce a shared phoneme in each enrolled language? Why don't you think just one good dataset for each language is enough to learn this embedding? Introducing ASR dataset may confuse the model since the goal of TTS is to learn how to speak clearly.

trfnhle commented 4 years ago

@tekinek

Why don't you think just one good dataset for each language is enough to learn this embedding?

I just think one good dataset for each language just are too small and not diversity to learn general ligual feature. You right, ASR dataset could cause noise, but we could first train on a mixed dataset and then finetune just only target dataset. Hopefully, the model still remains lingual feature learned from the mixed dataset.

mychiux413 commented 4 years ago

ASR dataset doesn't work, I have already tried ASR dataset (English/Mandarin) with tacotron2, the outputs are full of glitches.

To prove that the multi-lingual is trainable, I'm experimenting tacotron2 with a dataset, which was all generated by Google Speech Synthesis (Speaker: cmn-TW-Wavenet-A) in languages: Mandarin(90%) + English(10%), I used this dataset to simulate the condition: single speaker + large data size (100K wav files) + multi-lingual, And I think use such dataset to train a base model for transfer learning should keep more plentiful lingual features than ASR dataset.

The following is the performance so far:

If I just mixed simple En words into text, like: 以下是您搜尋的型號ABC一百D, the output could be audible (waveform with pretrained EN Multi-band MelGAN).
However, if I also add pure En sentences to the dataset for training, like: "This is the product you want.", the loss will be higher than before, and the stop token loss will become worse, so the waveform in the ending interval could be a lot of noise.

I'm not sure if I need to adjust the parameters to increase the number of neurons. If so, can someone suggest which parameter I should increase first?

trfnhle commented 4 years ago

@mychiux413 You mention Google Speech Synthesis. is it this one google-tts. I just synthesize some random English sentences with speaker cmn-TW-Wavenet-A from this page. Chinese-speak-English Audio is audible but has a quality not good as Chinese-speak-Chinese. I think it may be the reason. Besides the dataset problem, do you try to add language embedding to each token? I think it will help the model clarify between English sentences and Chinese sentences.

mychiux413 commented 4 years ago

@l4zyf9x Yes, it comes from google-tts, and thanks for your suggestion, I will try it.

leijue222 commented 4 years ago

@mychiux413 Excuse me, could you please tell me where I can get the Chinese-English mixed dataset？I want to use tacotron2 to try a Chinese-English mixed TTS model.

mychiux413 commented 4 years ago

@leijue222 I don't have Chinese-English mixed dataset, so I just collect some Chinese-English text online by crawler or some open-data(I don't remember where it comes from), then I use google-tts to generate speech data and package them as a dataset.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

monatis commented 3 years ago

Linking a relevant paper: Building Multi lingual TTS using Cross Lingual Voice Conversion

michael-conrad commented 3 years ago

The repo at https://github.com/Tomiinek/Multilingual_Text_to_Speech uses both a speaker embedding as well as a language embedding. Why not implement both embeddings?