Discuss Chinese-English mixed TTS.

Tomiinek / Multilingual_Text_to_Speech

An implementation of Tacotron 2 that supports multilingual experiments with parameter-sharing, code-switching, and voice cloning.

MIT License

830 stars 157 forks source link

Discuss Chinese-English mixed TTS. #27

Closed leijue222 closed 3 years ago

leijue222 commented 4 years ago

Hi, @Tomiinek . It's a nice job, and it's an honor to see this project. I have some questions about train.txt; hope you can solve my puzzle.

How to get the original train.txt file? Git clone this project, under /data/css10/, we can see the train.txt (original). After the prepare_css_spectrograms.py file is processed, we got the spectrogram and linear spectrogram and changed the structure of train.txt (processed)

(original) 000285|chinese|chinese|chinese/call_to_arms/call_to_arms_0285.wav|||húixiāngdòu de húizì， zěnyáng xiě de?|
(processed) 000285|chinese|chinese|chinese/call_to_arms/call_to_arms_0285.wav|spectrograms\000285.npy|linear_spectrograms\000285.npy|húixiāngdòu de húizì， zěnyáng xiě de?|

So, how to get the train.txt (original) file? I want to create it for another dataset.

The structure of train.txt

I have questions about the meaning of these variables: idx, s, ph https://github.com/Tomiinek/Multilingual_Text_to_Speech/blob/ca00959d231bcfe0dc845d8c1938e869daf3feae/data/prepare_css_spectrograms.py#L57

idx: As long as we make sure that IDX is unique and points to specific audio, we can define it in any way we want, right? Usually, the id is defined by the name of the file, but I found that you did not do this. Can I define this variable with the filename of the audio?
s: speaker? Puzzle. If the data set has only one person's voice, it is defined as the language name? Otherwise, it is defined as the serial number of different people (0,1,2,3,4....)?
ph: I don't understand the meaning of this variable, is that mean "\n"?

Tomiinek commented 4 years ago

Hello, thank you!

At first, you have to write your own loader that will go through transcripts and audio files of your custom dataset. It is enough to implement it in the file loaders.py, I have already implemented some loaders for popular datasets such as LJSpeech, VCTK and M-AILABS. So you can start with the code and modify it to your own dataset. Second, you will have to run the script TextToSpeechDataset.create_meta_file. Please, read through its description and adjust the arguments to your needs. It will generate the meta-file and spectrograms you are looking for :slightly_smiling_face: Note that the dataset_name argument should match the name of your loder function, i.e., for example for generating the meta-file of M-AILABS use dataset_name = mailabs.
The meta-file format is also described in the function mentioned above. idx - yes, exactly as you say; s - it is just a speaker id, so you can use whatever you want, I used languge name and possibly a suffix if there were more speakers of the particular language (e.g., zh-00, zh-01, ...); ph - this field can contain a phonemicized variant of the transcript, again, it can be automatically generated with the function above (but it takes really long time).

Enjoy! :partying_face:

leijue222 commented 4 years ago

Thank you for your detailed reply. Your code specification and looks comfortable. I noticed the loaders.py file, and then I tried to use the ljspeech function to load the dataset and refer to the existing function format to added the biaobei function to load the dataset. So I processed two datasets now:

LJSpeech: 13,100 English audio clips | 24 hours | one woman speaker
Biaobei: 10,000 Chinese audio clips | 12 hours | one woman speaker

Question 1: The parameter "generator_bottleneck_dim" and "generator_dim" set be what number to train will be ok? Question 2: Maybe I don't have to run the script TextToSpeechDataset.create_meta_file? I had seen the description. It used to create spectrograms, linear spectrograms, and phonemized_text.

For "phonemized_text" The "phonemized_text" in the prepare_css_spectrograms.py is ph and ph="", I don't know why. So I asked you what it means. Can I keep the ph="" right? I just use Chinese and English dataset. And in your example, I see the ph="" too.
For "spectrograms" and "linear spectrograms" As I said yesterday, I modified the prepare_css_spectrograms.py and got the spectrogram and linear spectrogram by prepare_css_spectrograms.pyand changed the structure of train.txt to add spectrogram | linear spectrogram.

https://github.com/Tomiinek/Multilingual_Text_to_Speech/blob/7d96891d22e586660c6da654f7af702749eafc68/dataset/dataset.py#L250-L254 https://github.com/Tomiinek/Multilingual_Text_to_Speech/blob/7d96891d22e586660c6da654f7af702749eafc68/data/prepare_css_spectrograms.py#L71-L77

This part of the processing of both files is for generation spectrogram | linear spectrogram. Since I had generated spectrogram by prepare_css_spectrograms.py, maybe I don't have to go through TextToSpeechDataset.create_meta_file anymore.

Snipaste_2020-10-31_11-04-44

Tomiinek commented 4 years ago

Congratulations :slightly_smiling_face:

You use just two languages, so I would choose very low numbers such as generator_bottleneck_dim=1 and generator_dim=2. Note that the performance gain is more visible if you use more language for training, even if you do not use them later during inference. Also, you need more speakers in the data for a successful voice cloning and code-switching (see this issue #7)
You are absolutely right. If you already merged the datasets and run prepare_css_spectrograms.py, you do not need to run TextToSpeechDataset.create_meta_file anymore. Yes, just keep ph="", it means there is no phonemicized variant, so it cannot be used later via the parameter use_phonemes = True.

leijue222 commented 4 years ago

Congratulations 🙂

You use just two languages, so I would choose very low numbers such as generator_bottleneck_dim=1 and generator_dim=2. Note that the performance gain is more visible if you use more language for training, even if you do not use them later during inference. Also, you need more speakers in the data for a successful voice cloning and code-switching (see this issue #7)

You are absolutely right. If you already merged the datasets and run prepare_css_spectrograms.py, you do not need to run TextToSpeechDataset.create_meta_file anymore. Yes, just keep ph="", it means there is no phonemicized variant, so it cannot be used later via the parameter use_phonemes = True.

Thanks!

I want to start by using these two datasets to get through the whole pipeline. Then, I'll try to tune it by adding more speakers of these two languages.
I got it! Choose one of prepare_css_spectrograms.pyand TextToSpeechDataset.create_meta_file to get spectrogram and linear spectrogram is fine. And Chinese and English do not need the phonemicized variant, so just keep ph="".

Thanks again for your reply, best regards to you.

leijue222 commented 4 years ago

There is an error when I run train.py: AttributeError: type object 'Params' has no attribute 'mel_normalize_mean' The error comes from line 107: https://github.com/Tomiinek/Multilingual_Text_to_Speech/blob/83b1164e9d385742558b8b0171568816e7f51e44/utils/audio.py#L105-L108 Then I checked the params.py, found that there is indeed no definition of the mel_normalize_mean. How can I get the values of these two parameters?

By the way, for my case of training two languages, which default parameters must be changed? I want to achieve the effect of "Chinese w/ Russian" like your example website, but I need to change Russian to English.

Currently, I refer to the generated_switching.json modified these parameters：

dataset = "zh_en"                    # one of: css10, ljspeech, vctk, my_blizzard, my_common_voice, mailabs, must have implementation in loaders.py
languages = ['chinese', 'english']   # list of lnguages which will be loaded from the dataset, codes should correspond to
multi_speaker = True                 # if True, multi-speaker model is used, speaker embeddings are concatenated to encoder outputs
multi_language = True                # if True, multi-lingual model is used, language embeddings are concatenated to encoder outputs
speaker_embedding_dimension = 2      # used if multi_speaker is True, size of the speaker embedding
language_embedding_dimension = 2     # used if multi_language is True, size of the language embedding
input_language_embedding = 2         # used if encoder_type is 'shared', language embedding of this size is concatenated to input char. embeddings
generator_dim = 2                    # used if encoder_type is 'generated', size of the 'language embedding' which is used by layers to generate weights
generator_bottleneck_dim = 1         # used if encoder_type is 'generated', size of fully-connected layers which generate parameters for encoder layers
encoder_dimension = 256              # output dimension of the encoder
balanced_sampling = True             # enables balanced sampling per languages (not speakers), multi_language must be True
perfect_sampling = True              # used just if balanced_sampling is True, should be used together with encoder_type 'convolutional' or 'generated'
                                     # if True, each language has the same number of samples and these samples are grouped, batch_size must be divisible
                                     # if False, samples are taken from the multinomial distr. with replacement
reversal_classifier = True           # if True, adversarial classifier for predicting speakers from encoder outputs is used

I am not engaged in research in this field. There are a lot of parameters and many parameters are unfamiliar to me.

leijue222 commented 4 years ago

Oh, I see! https://github.com/Tomiinek/Multilingual_Text_to_Speech/blob/83b1164e9d385742558b8b0171568816e7f51e44/train.py#L248 The mel_normalize_mean parameter was successfully calculated on line 248, but audio.py didn't get it, let me figure out how to solve it.

leijue222 commented 4 years ago

Environment: The system is win10, two 1080Ti GPU, and other environments are consistent with requirement.txt I made the following changes to make it able to run train.py:

Write the parameter result into param.py.

change this line https://github.com/Tomiinek/Multilingual_Text_to_Speech/blob/83b1164e9d385742558b8b0171568816e7f51e44/modules/encoder.py#L41 to:

x = torch.nn.utils.rnn.pack_padded_sequence(x, x_lenghts, batch_first=True, enforce_sorted=False)

Otherwise, it will get the following error:

RuntimeError: `lengths` array must be sorted in decreasing order when `enforce_sorted` is True. You can pass `enforce_sorted=False` to pack_padded_sequence and/or pack_sequence to sidestep this requ
irement if you do not need ONNX exportability.

Finally, I can run train.py! During the process, sometimes will print the following warning: UserWarning: PyTorch is not compiled with NCCL support warnings.warn('PyTorch is not compiled with NCCL support') warning: audio amplitude out of range, auto clipped. It has been running for about 1 hour. The log is: It has been running for about 3.5 hours. The log is:

Therefore, I have some question: The result of the two parameters is a two-dimensional array with 80 rows. Is it the correct way to write them directly into param.py? And is there anything wrong with my above operation? @Tomiinek

Tomiinek commented 4 years ago

Hello,

Ad the parameters you should change) I think that you are using a single speaker per language, so you can probably set multi_speaker = False.
Ad params.py) I do not know how is it possible. These two parameters are calculated once you start training from scratch (and not from a checkpoint). They are clearly assigned before any further usage :confused: It does input spectrogram normalization, i.e. it goes through your dataset, computes mean and variance for each of the 80 rows. These values are afterwards saved so that they can be later used: https://github.com/Tomiinek/Multilingual_Text_to_Speech/blob/83b1164e9d385742558b8b0171568816e7f51e44/train.py#L246-L250 It is needed to save them into Params because they have to be used again during inference or when restoring a checkpoint.
Ad warning: audio amplitude out of range, auto clipped.) This can happen at the beginning of training, because output spectrograms can be strange and their inversion produces invalid audios.

Tomiinek commented 4 years ago

Logs look ok ... the total loss and MCD is falling down, great news! :slightly_smiling_face:

leijue222 commented 4 years ago

I don’t know if the way I save these two parameters is correct？ My current way is to manually add mel_normalize_mean and mel_normalize_variance to params.py: parpams Is this the right way to add parameters manually? I feel a bit silly my way. Because train.py calculated the values successfully, but audio.py can't get the values, I will get an error of type object 'Params' has no attribute 'mel_normalize_mean'. So I have to add these two parameters manually.

Do you have a better idea to make audio.py get train.py calculated the value of mel_normalize_mean?

———————————— Dividing line ——————————————

The Loss trend is good, so it seems that adding the enforce_sorted=False parameter in encorde.py not have much effect.

I stopped and set multi_speaker=False, reversal_classifier = False to retrain. Before, I ran 3K steps in 6 hours without saving any checkpoint. So I set checkpoint_each_epochs=1. It has been training for 2 hours now: logs3 It has been training for 9 hours and interrupted by CUDA error: unspecified launch failure now. logs4

Now I load checkpoint (dict) with saved to continue training.

By the way, how long will it have good results, or say how many steps should be trained, 50K? It is estimated that it will take 4 days to train 50K steps, which is too long. I hope I can get good results from the intermediate results.

leijue222 commented 4 years ago

Model result of 1.0_loss-17-0.123 example 1: (english-ljs.wav)

Different models expect different lines, some have to specify speaker, language.

example 2: (chinese-biaobei.wav)

jìsuànjī dàxué zhǔyào xuékē shì kēxué hé jìzhúbù， xuéshēng kěyǐ huòqǔ jìsuànjīkēxué hé jìzhú de běnkē xuéwèi

So far, training has exceeded 10 hours. But these two results English is not English, Chinese is not Chinese. I must make a wrong setting, maybe I shouldn't set ph="". I should have phonemized_text and use phonemized_text to train rather than pinyin or English words. wav.zip

Tomiinek commented 4 years ago

Hello again,

Is this the right way to add parameters manually? I feel a bit silly my way. Because train.py calculated the values successfully, but audio.py can't get the values, I will get an error of type object 'Params' has no attribute 'mel_normalize_mean'. So I have to add these two parameters manually.

It should work this way, but I think you should figure out what is happening and why the original code does not work. Can you please post here the whole stack trace? The function from audio.py has to be called from train.py ...

So far, training has exceeded 10 hours. But these two results English is not English, Chinese is not Chinese. I must make a wrong setting, maybe I shouldn't set ph="". I should have phonemized_text and use phonemized_text to train rather than pinyin or English words. wav.zip

The training should take more than 10 hours. I do not remember exactly, but let's say 25k training steps with properly set guided attention decay should be enough.

The samples are a good start, isn't it? I can recognize some of the English words ... How about the attention plot? Once a sharp "diagonal" is established, victory is yours.

leijue222 commented 4 years ago

Thanks for your kind reply.

It should work this way, but I think you should figure out what is happening and why the original code does not work. Can you please post here the whole stack trace?

I withdrew the whole stack trace image because there is a simple solution: using Ubuntu to train is a wise choice. In Win10, there are two compatibility problems:

audio.pycan't get the parameter of mel_normalize_mean.
Unable to solve the compatibility problem caused by espeak when using TextToSpeechDataset.create_meta_file to generate phonemes under Win10.

You're right. The samples are a good start. For English, it can recognize some of the English words. But pinyin is equivalent to a syllable combination, plus the tone, the data should be sparse. Therefore, use phonemes may be better.

leijue222 commented 4 years ago

Goal: A TTS model that can mix Chinese and English LJSpeech: 13,100 English audio clips | 24 hours | one woman speaker Biaobei: 10,000 Chinese audio clips | 12 hours | one woman speaker

Note about my training steps:

Use sox to normalize audio files and resampling to 22050 uniformly.
Wirte function to load my dataset implement it in the file loaders.py. Then, organize the train.txt and val.txt by TextToSpeechDataset.create_meta_file(Note set use_phonemes=True and use_preemphasis=True before do it; train.txt : val.txt = 9 : 1).

3. Params.py setting:

******************* DATASET SPECIFICATION *******************
dataset = "zh_en"
languages = ['zh', 'en']
balanced_sampling = True
perfect_sampling = True 
*************************** TEXT ****************************
use_phonemes = True
******************** PARAMETERS OF AUDIO ********************
use_preemphasis = True
******************** PARAMETERS OF MODEL ********************
encoder_type = "generated"    # In your website samples, I thought generated type is the best!
*generator_dim = 2                
*generator_bottleneck_dim = 1

multi_speaker = False       # Two languages, each language has only one speaker
*speaker_embedding_dimension = 32 #If each language has 10 speakers, what should this value be?

multi_language = True
*language_embedding_dimension = 2

The training time is too long; I am worried that the parameter error will lead to bad training results. @Tomiinek Could you please help me recheck my parameter settings? Especially the Model parameters with *, I can't ensure their size from the code comments.

Tomiinek commented 4 years ago

Hello,

I withdrew the whole stack trace image because there is a simple solution: using Ubuntu to train is a wise choice. In Win10, there are two compatibility problems:

audio.py can't get the parameter of mel_normalize_mean.

Unable to solve the compatibility problem caused by espeak when using TextToSpeechDataset.create_meta_file to generate phonemes under Win10.

Ah, I am sorry for this inconvenience. I have unfortunately no experience with running on Win10. Good point though! Thank you.

Wirte function to load my dataset implement it in the file loaders.py. Then, organize the train.txt and val.txt by TextToSpeechDataset.create_meta_file(Note set use_phonemes=True and use_preemphasis=True before do it; train.txt : val.txt = 9 : 1).

Just a note, do phonemes reflect the tones as pinyin?

Could you please help me recheck my parameter settings? Especially the Model parameters with *, I can't ensure their size from the code comments.

Sure, I think that the generated model does not use the language_embedding_dimension parameter at all, so do not care about that. This parameter is replaced by generator_dim, and 2 or 3 is reasonable, as you have just 2 languages. generator_bottleneck_dim should be less than generator_dim, I would choose 1. You have hot 20 speakers in total, so something like 16 should hopefully be ideal for speaker_embedding_dimension. Other mentioned parameters are ok.

leijue222 commented 4 years ago

Ah, I am sorry for this inconvenience. I have unfortunately no experience with running on Win10. Good point though! Thank you.

I chose WIN10 due to the lack of Ubuntu hard drives capacity. Generally, people will not train under WIN10. After all, many compatibility problems cannot be avoided.

Just a note, do phonemes reflect the tones as pinyin?

Yes, refer to your code; I use jieba and pinyin package convert Chinese into pinyin, and then generate phonemes.

Other mentioned parameters are ok.

Thank you very much!
And I remember you gave a nice reference for voice cloning and code-switching before:

Add more speaker for each language: such as add 20 speaker for each language; And each speaker doesn't have to have many examples: 50 transcript-recording pairs per speaker can be ok.

But my current data is a bit relative compared to your suggestion.

*************************** COMPARE ****************************
                                     Number of speakers in each language     |      Number of audio file per speaker
My situation                                              1                                                                zh:10,000      en:13,100
Your suggestion                                10 or 20                                                                        50 is ok

Training 50K steps is a long process. I hope I can have a good result when code-switching between en and zh. If the result is not good, then I have to rearrange the data: increase speakers' number and reduce the number of audio for each speaker.

------------------------------------Training record----------------------------------- Time: 15.5 hours | Epoch: 37 | steps: 13.5K Test: 1.0_loss-37-0.106 (Some words in English are pronounced a bit like English, but Chinese has not yet progressed) sendpix5 Time: 32.25 hours | Epoch: 78 | steps: 27.68K Test: 1.0_loss-78-0.097 (Both Chinese and English pronunciation are relatively correct！) sendpix6 Time: 2.5 days| Epoch: 144 | steps: 50.5K sendpix9 Loss can't go down anymore, Chinese has the problem of inaccurate tone

leijue222 commented 4 years ago

language: en and zh 
speaker: `en-ljs` and `zh-biaobei`

New progress! Example of only English: en-ljs-79.wav and only Chinese: zh-biaobei-79.wav

Text: As a result of the infusion of liquids through the cutdowns, the cardiac massage, and the airway Order: {Text}|en-ljs|en Text: huángyù línghǎn yǎ le hóulóng， juéwàng dì tānzuò zài dìshàng。 Order: Text|zh-biaobei|zh

These two pronunciations are relatively correct. If WaveRNN processes them may be better.

Example of clone speaker voice of clone_voice-ljs2biaobei.wav

Text: As a result of the infusion of liquids through the cutdowns, the cardiac massage, and the airway,|zh-biaobei|en Order: {Text}|zh-biaobei|en

Example of code-switching: zh-en-mix.wav

Text: huángyù línghǎn yǎ le hóulóng，As a result of the infusion of liquids through the cutdowns, Order: {Text}|zh-biaobei|zh-10,en-10,zh

As you said, both code-switching and voice-cloning are failed. wavs_test.zip I have an idea. Maybe I can directly train on the Common Voice dataset you provided. I delete one of the languages and add English to the CommonVocie dataset. @Tomiinek By the way, do you use phonemes and preemphasis when using Cleaned Common Voice training code-switching models?

Tomiinek commented 4 years ago

If I were you, I would definitely use as much langauges and speakers as possible. It will definitely improve both - English and Chinese and you can just throw them away during inference. So incorporating English into the provided datasets is IMHO great idea. You can use the TUNDRA dataset, there are more languages with just a little data and it includes also English and it is segmented similarly as CSS10.

No I did not use phonemes, but I used preemphasis. It slightly improves the synthesis when using Griffin-Lim, but if you use WaveRNN or something like that, it does not seem to be important.

leijue222 commented 4 years ago

Hi, @Tomiinek Thanks very much. There is good progress that I succeeded in code-switching, but the Chinese pitch is too weird. F51E1768F344703800B0625C9C7FC366 Ground truth mel obviously has a slope, but what I generated is flat, so the pitch will be different

Does Params.py have any parameters that affect the pitch?

Oh, I see. Inaccuracy is not absolute. Some sentences of the same word are pronounced accurately, while others are pronounced incorrectly.

Tomiinek commented 4 years ago

Great plot :+1:

Well, you can try adding tone labels explicitly as an input (and use for example stresses in English encoder). But the code does not include this extension, so you would have to add it yourself. We have already discussed this in #11

leijue222 commented 4 years ago

Great plot

Well, you can try adding tone labels explicitly as an input (and use for example stresses in English encoder). But the code does not include this extension, so you would have to add it yourself. We have already discussed this in #11

Tone label is a good idea. What made me excited is your example. Input pinyin, and the output audio is correct(Even my failed tone case in your pre-trained model is correct too). The result of my training can't reach the same pitch as you. You tell me tones are not supported, but your model Chinese tone is right. I can't figure it out; it's too weird.

Tomiinek commented 4 years ago

You tell me tones are not supported, but your model Chinese tone is right.

I used directly grapheme inputs and Chinese was a low-resource language. Pinyin has explicitly tone marks above characters, so the model probably understands them. Some of them are shared with French, such as é and è, but it does not seem to be a problem.

I cannot distinguish tones very well (even though I completed a few Chinese lectures in Duolinguo :smile: ), so I am happy to hear that it works in the demo :smile:

You are using phonemes, right? Can you show a few examples of your phonemicized Chinese sentences (I mean the IPA representation)?

leijue222 commented 4 years ago

You are using phonemes, right? Can you show a few examples of your phonemicized Chinese sentences (I mean the IPA representation)?

Of course！Here:

000004|zh-biaobei|zh|zh/biaobei/wavs/000004.wav|spectrograms/013103.npy|linear_spectrograms/013103.npy|dèngxiǎopíng yǔ sāqiēěr hùiwù。|tə5ŋɕjɑu2phiɜŋ ʲy2 sɑ5tɕhiɛ5ər2 xu5i1wu5。

LJ001-0001|en-ljs|en|en/ljspeech/wavs/LJ001-0001.wav|spectrograms/000000.npy|linear_spectrograms/000000.npy|Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition|pɹɪntɪŋ, ɪn ðə əʊnli sɛns wɪð wɪtʃ wiː ɑː at pɹɛzənt kənsɜːnd, dɪfəz fɹɒm məʊst ɪf nɒt fɹɒm ɔːl ðə ɑːts and kɹafts ɹɛpɹɪzɛntɪd ɪn ðə ɛksɪbɪʃən

all.txt This is all the supervised data generated by create_meta_file. I forgot to change the naming code of spectrograms so that the naming of spectrograms is not idx, but rest assured, I verified that wav and npy correspond correctly.

Some of them are shared with French, such as é and è, but it does not seem to be a problem.

I am using your code to train pure Chinese (using phonemes) to determine whether Chinese is feasible or not. change use_phone, other default params.py，24K steps still can't talk any Chinese, so I stopped it. There are must change some parameter. And It seems that my approach is completely wrong, the key is to join the German training together to achieve sharing, not train only Chinese.

So far, I have tried the following:

use phonemes | succeed in code-switching, | problem is Chinese tones are often inaccurate | 50K steps
use zh-pinyin and en-words | succeed in code-switching | problem is Chinese tones are not inaccurate and whole quality is not as good as 1 | 50K steps
use CommonVoice, add one zh speaker(100+ audio files), add 9 en speaker(100+ audio files) | problem is English pronunciation is wrong | 300 epochs

2 is still training, 3 I did not go into it. I have a hunch that the effect of 2 may be similar to or even worse than 1.

I have asked many questions, and I have disturbed you to a certain extent. I hope you don't mind. But your work is really good, so I really want to reproduce your effect in both Chinese and English.

To be honest, I am a little powerless now. Maybe I should make the final fight, use your default code-switching parameters, not phonemes, use the five languages in css10 and CommomVoice, delete one of them (keep German), and add English. English speakers can be a bit more, and Chinese speakers can be added with a biaobei speaker with 10,000 audio files(I don't know if there are so many audio files for a speaker. After all, other speaker only about 100 audio files.).

I got another idead of change input, replace pinyin with consonant + tone. I will share with you if I make progress.

Tomiinek commented 4 years ago

Hello again :slightly_smiling_face:

I am using your code to train pure Chinese (using phonemes) ...

Hmm, look at the phonemicized variants. I can't read tones from them. For example phiɜŋ can correspond both to 俜 or 瓶 (I don't know what does it mean, I just used a pinyin convertor), so it might be challenging for the model to guess the right tone just from a sentence or context, isn't it?

use phonemes | succeed in code-switching, | problem is Chinese tones are often inaccurate | 50K steps

So this is not surprising for me.

use zh-pinyin and en-words | succeed in code-switching | problem is Chinese tones are not inaccurate and whole quality is not as good as 1 | 50K steps

But this is surprising :smile: ... What data did you use?

use CommonVoice, add one zh speaker(100+ audio files), add 9 en speaker(100+ audio files) | problem is English pronunciation is wrong | 300 epochs

Common Voice itself is really small, you should include also CSS10.

To be honest, I am a little powerless now.

Don't give up! :+1:

Maybe I should make the final fight, use your default code-switching parameters, not phonemes, use the five languages in css10 and CommomVoice, delete one of them (keep German), and add English. English speakers can be a bit more, and Chinese speakers can be added with a biaobei speaker with 10,000 audio files(I don't know if there are so many audio files for a speaker. After all, other speaker only about 100 audio files.).

This definitely make sense. We have already discussed it here #29

I got another idead of change input, replace pinyin with consonant + tone. I will share with you if I make progress.

Could you please describe the idea in other words? Tones are coupled with vowels? so I do not understand ...

leijue222 commented 4 years ago

For example phiɜŋ can correspond both to 俜 or 瓶

It turns out that some phonemes synthesized by phonemizer package have no tones. . .

But this is surprising 😄 ... What data did you use?

I told before, just LJSpeech and Biaobei. The loss are here: 屏幕截图 2020-11-13 13:55:46 By the way, when is it better to stop training? I always train to 50K according to your experimental settings, but it seems that loss_total drops very little after 20K

Could you please describe the idea in other words? Tones are coupled with vowels? so I do not understand ...

Give an example, it's a bit difficult to describe. The annotation of the Chinese Biaobei dataset like ka2 er2 pu3, some people will process ka2 er2 pu3 into k a2 er2 pu3. In Fastspeech2, some people will complete this process through MFA, and also can write code conversion(The baker.py file of TensorFlowTTS has this conversion process)

Don't give up!

Thank you brother, I will try it go on!

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

leijue222 commented 3 years ago

Hi, @Tomiinek I figured out why the pitch is inaccurate because the parameter string lacks numbers! I change character https://github.com/Tomiinek/Multilingual_Text_to_Speech/blob/83b1164e9d385742558b8b0171568816e7f51e44/params/singles/zh.json#L4 to "characters": "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890", Then the tone is right!

Here is samples：zh_sample.zip I use the symbol - to represent the end of a syllable, and / to represent the participle.

s.wav
k a 2 - er 2 - p u 3 / p ei 2 - w ai 4 - s un 1 / w an 2 - h ua 2 - t i 1 / 。

s1.wav
b ao 2 - m a 3 / p ei 4 - g ua 4 / b o 3 - l uo 2 - an 1 / ， d iao 1 - ch an 2 / y uan 4 - zh en 3 / d ong 3 - w eng 1 - t a 4 / 。

s2.wav
z ai 4 / y u 4 - zh ong 1 / ， zh ang 1 - m ing 2 - b ao 2 / h ui 3 - h en 4 / j iao 1 - j ia 1 / ， x ie 3 - l e / y i 2 - f en 4 / ch an 4 - h ui 3 - sh u 1 / 。

s1.wav and s2.wav is all right! But s.wav, I don't understand why p u 3 made sounds two times.

I have one more question: Why don't you restrict the range of mel to [-4,4], which is the normalization of mel. This can more effectively speed up the convergence and improve the results. I have not yet made normalized changes to your project.

Tomiinek commented 3 years ago

Congratulations! :partying_face:

Why don't you restrict the range of mel to [-4,4], which is the normalization of mel. This can more effectively speed up the convergence and improve the results. I have not yet made normalized changes to your project.

Good question :slightly_smiling_face: I just decided to normalize it in this way without experimenting.

I do know that some other implementations use the normalization to [-4, 4] and argue that it improves results over normalization to [-1, 1] or [0, 1]. I really dislike the magical constant 4 and the way of normalization and I haven't found any paper describing it or proving that it really is what I want. Thus I use somehow standard normalization used for images ...

I am convinced that normalization is important and that:

Mean over data should be 0, as if an output neuron is not activated and biased at all, it just produces mean value which is a pretty good start (and the standard deviation also makes sense to me). Everything is kept in reasonable ranges which is what we want because activation functions are interesting in limited areas.
Normalization of all mel bands in the same way is weird (e.g. taking max value and min value and scaling all spectrograms to this range uniformly), because we expect values with different statistics in different frequency bands (for example, suppose that the 80. mel band is oxtraordinary and always very low like -10000000, so let's make it zero so it is easy for the network to predict it).

What do you think about that?

leijue222 commented 3 years ago

It can be understood this way. Because the prediction mel is doing regression, the MSE loss function is used, which is equal to 80 dimensions. The prediction of each dimension is independent, and the loss value is equal to the weighted value. The weight of each dimension is positively related to its own value. This leads to the larger the value of the dimension, the greater the impact on the loss value, the smaller the value may not be able to learn. For example, the two predicted values of [1000, 0.01] are changed by 10% each, and the loss value is similarly changed by 100+0.001. You can see that the small value can be ignored. If the value of 1000 is always poorly learned, then the loss value is Will fluctuate in this dimension, and cannot pay attention to the 0.01 dimension. Normalization equals that the weight of the loss function is 1, so that the model pays equal attention to each output. Because the predictions are all logmels, there is a large difference in values. The absolute value of the low-frequency dimension may be smaller, and the high-frequency dimension is much larger, so normalization is required.

And talk a trick about pause: The quality of open-source datasets is not good enough, and the pause time of some datasets is short, which makes it difficult for the model to learn the pause naturally from punctuation. Therefore, normalization is also more convenient to add silent Mel control punctuation pause. For example, I want to synthesis 2 minutes of audio, split long sentences into short sentences based on punctuation, and add mute Mel when splicing to complete the synthesis of long sentences. (If normalized to [-4,4], the value of the mute end Mel is -4, which is more controllable)

leijue222 commented 3 years ago

Thank you for your constructive comments. d eng 4 - x iao 3 - p ing 2 / y u 3 / s a 4 - q ie 4 - er 3 - are good friends do you know?|en-ljs|english-41,chinese-62,english https://user-images.githubusercontent.com/30276789/102761078-60f6ee00-43b1-11eb-8c69-1a4de83182f5.mp4

It's a success for zh-en mixed TTS. A drawback is failed the voice clone. I have a question, why a small number of speakers will cause voice cloning to fail so that there are multiple voices in a sentence.

Tomiinek commented 3 years ago

Hello!

Unfortunately, I cannot play the audio from the previous post.

Well, if the model encounters just one or a few speakers for each language, it cannot learn the difference between language and voice characteristics and is not able to disentangle them (if you provide more speakers, the model can no longer explain the variation of voices by the language embedding or better said language-specific encoder outputs and figures out that speaker embeddings are the thing it should rely on). Thus the change of language also changes voice. The components of the model try to fight this issue, but you still need several training speakers for each language. Another approach is to use phoneme inputs and remove all language-specific components at all.

michael-conrad commented 3 years ago

It can be understood this way. Because the prediction mel is doing regression, the MSE loss function is used, which is equal to 80 dimensions. The prediction of each dimension is independent, and the loss value is equal to the weighted value. The weight of each dimension is positively related to its own value. This leads to the larger the value of the dimension, the greater the impact on the loss value, the smaller the value may not be able to learn. For example, the two predicted values of [1000, 0.01] are changed by 10% each, and the loss value is similarly changed by 100+0.001. You can see that the small value can be ignored. If the value of 1000 is always poorly learned, then the loss value is Will fluctuate in this dimension, and cannot pay attention to the 0.01 dimension. Normalization equals that the weight of the loss function is 1, so that the model pays equal attention to each output. Because the predictions are all logmels, there is a large difference in values. The absolute value of the low-frequency dimension may be smaller, and the high-frequency dimension is much larger, so normalization is required.

And talk a trick about pause: The quality of open-source datasets is not good enough, and the pause time of some datasets is short, which makes it difficult for the model to learn the pause naturally from punctuation. Therefore, normalization is also more convenient to add silent Mel control punctuation pause. For example, I want to synthesis 2 minutes of audio, split long sentences into short sentences based on punctuation, and add mute Mel when splicing to complete the synthesis of long sentences. (If normalized to [-4,4], the value of the mute end Mel is -4, which is more controllable)

I working on trying to get the Cherokee language working in a fork of this repository.

The amount of data is low, and I'm training with 4 other languages, however, I've run int the issue where the TTS wants to keep rambling on after the end of the pronounced text.. I'm guessing this is top token failure?

Any suggestions?

https://github.com/CherokeeLanguage/Cherokee-TTS