The problem of voice quality and voice conversion

jayzhu02 commented 3 years ago

Hello，this project is so nice and thank you for your share! I've train English and Chinese model with a total of hundreds of speakers in each language using LibriTTS and thchs30(Chinese dataset) and a private dataset. All data are resampled to 22k and denoise. This time I try to use phonemes (phonemize) and especially add tone in Chinese. Now it trains 25k steps and the loss drops well. the result is OK and it could pronounce right. But it still has problems:

The inference audio always has noise.
While I try to do the voice conversion(which is my main task) to let a speaker who never says Chinese to speak well, the output audio is not his voice at all. That makes me really confused.

So I'm wondering if you can give me some advice to optimize it. Thx! There are my params:

"balanced_sampling": True, "batch_size": 80, "case_sensitive": False, "checkpoint_each_epochs": 20, "encoder_dimension": 256, "encoder_type": "generated", "epochs": 1000, "generator_bottleneck_dim": 1, "generator_dim": 2, "languages": ["zh", "en"], "language_embedding_dimension": 0, "learning_rate": 0.001, "learning_rate_decay_each": 10000, "learning_rate_decay_start": 10000, "use_phonemes":True, "multi_language": True, "multi_speaker": True, "perfect_sampling": True, "predict_linear": False, "reversal_classifier": True, "reversal_classifier_dim": 256, "reversal_classifier_w": 0.125, "reversal_gradient_clipping": 0.25, "speaker_embedding_dimension": 256,

Tomiinek commented 3 years ago

Hello :slightly_smiling_face:

The parameters seem good to me.

I have a few questions:

The sound is ok when training, but the results are much worse during inference. Are they just more noisy or do they have problems with attention/stability and things like that?
You have got hundreds of speakers per language. How many samples do you have per speaker? You can try to visualize the embedding space to figure out what is going on. I would expect some clusters grouping male/famale voices etc. and no distinction across languages.
How do you convert the texts into phonemes? How do you manage tones? :eyes: Look into this thread #27 (especially at the end).

jayzhu02 commented 3 years ago

Hello, thx for your reply!

For the first question, I looked at the tensorboard's generated audio and it sounds ok. But during inference, all the audio always has noise in the background sound no matter which speaker is.
Speaker distribution is here. Most of the speakers have 50-100 samples. Since I use some public datasets, it's maybe hard for me to clusters male/female voice. I'll have a try.

Actually, I looked into this issue before and I use a package call phonemize and make some changes. For Chinese etc. 你好, I would change to pinyin with tone (ni3 hao3) then phonemize it (ni3 xɑu3). For English, I directly use the original function to change to phonemes like good evening -> ɡʊd iːvnɪŋ.

Tomiinek commented 3 years ago

Hello, I am sorry for a late response.

Regarding the noise, it is really weird. Unfortunately, I have no idea what could be the cause :cry:
The link to the image is broken. Do not bother yourself with the clustering, it was just a debug idea ...
Ok, I see. Just to make sure, did you add all the characters you use to the character set in the config file?

jayzhu02 commented 3 years ago

Hello, thx for your reply!

At first, I actually forgot to add tone number into phonemes character and after some trials, I found this question and fix it. it can pronounce right now.
I've another question: Is that too many speakers would influence other speakers' tone? In fact, I just want to use 4 or 5 speakers but the samples of these speakers are just about 200. So I try to add other speakers to make sure they can speak well. But the result shows that It could speak well but the sound doesn't like himself/herself.
Regarding the noise, now I try to solve it by denoise the inference audio or retrain a WaveRNN.

Tomiinek commented 3 years ago

Hello again :slightly_smiling_face:

I've another question: Is that too many speakers would influence other speakers' tone? In fact, I just want to use 4 or 5 speakers but the samples of these speakers are just about 200. So I try to add other speakers to make sure they can speak well. But the result shows that It could speak well but the sound doesn't like himself/herself.

Well, 200 samples per speaker is not much. I also was not able to get stable and accurate voice for speakers with few examples (I used these low-resource speakers just to help the model to disentangle language and voice).

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

krigeta commented 3 years ago

Hello @zj19980122, I also want to train it in three languages but may you please tell me a workflow on how to setup this?

jayzhu02 commented 3 years ago

Hello @zj19980122, I also want to train it in three languages but may you please tell me a workflow on how to setup this?

Hello.

1. you should generate the same form of txt file like the file in /data//.txt, then use _prepare_cssspectrograms.py to generate spectrogram.
1. Generate the config JSON
1. Training

In the first two steps you may write your own code to finish. Most usage details you can find in each .py file to see how to give parameters. Follow the README.md process.

krigeta commented 3 years ago

Hello, so far I am able to set up the directories and for now I am also using the CSS10 dataset so I will get an understanding of its working.

Now before running the prepare_css_spectrograms.py file I have some questions in mind:

I set up the comvoi.zip data which is in five languages, why we are setting that up to create the spectrograms, I think these are the languages later we use to get the output Please correct me on this?
i am using 3 Datasets Japanese, German and Chinese from CSS10 dataset and this dataset is used to train model, right?
How the workflow is working, like if I have a speaker in japanese with its accent then how can that japanese speaker is able to speak in Chinese and which accent will it use? I asked a lot but sorry i am new to this but this project is so much awesome that I want to learn. Please help

jayzhu02 commented 3 years ago

Hello, so far I am able to set up the directories and for now I am also using the CSS10 dataset so I will get an understanding of its working.

Now before running the prepare_css_spectrograms.py file I have some questions in mind:

I set up the comvoi.zip data which is in five languages, why we are setting that up to create the spectrograms, I think these are the languages later we use to get the output Please correct me on this?

i am using 3 Datasets Japanese, German and Chinese from CSS10 dataset and this dataset is used to train model, right?

How the workflow is working, like if I have a speaker in japanese with its accent then how can that japanese speaker is able to speak in Chinese and which accent will it use? I asked a lot but sorry i am new to this but this project is so much awesome that I want to learn. Please help

Sorry I can't clearly understand your first question.

If you want to use css10 as the dataset, you should create the corresponding txt files and the spectrograms since the model needs these to train.

For the third question, you can know that an audio contains the speaker's accent and words pronunciation, and MTS model could split them to train both speaker embedding and language embedding. So if your dataset has multiple speakers and languages, you can easily do the voice cloning(as your question mention). The example you can find in the notebooks/code_switching_demo.ipynb.

krigeta commented 3 years ago

I am confused on how to setup those files sir. If possible may i connect to you on other socials? like discord, email so I can explain every bit of it? or should I share it here?

jayzhu02 commented 3 years ago

I am confused on how to setup those files sir. If possible may i connect to you on other socials? like discord, email so I can explain every bit of it? or should I share it here?

If you're asking train.txt here is an example: css10_ja-css10_ja-meian_1015|css10_ja|ja|/data/css10_ja/meian/meian_1015.wav|../spectrograms/css10_ja-css10_ja-meian_1015.npy|../linear_spectrograms/css10_ja-css10_ja-meian_1015.npy|小林は覗き込むように見て云った。僕もそっちへ行くよ。彼らの行く方角には|ko̞bäjäɕi hä no̞zo̞ki ko̞mɯᵝ jo̞ɯᵝni mite̞ iʔ tä 。 bo̞kɯᵝ mo̞so̞ttɕihe̞ ikɯᵝ jo̞。 käɽe̞ɽä no̞ jɯᵝkɯᵝe̞ käkɯᵝ nihä

spectrogram_path and linear_spectrogram_path will be automatically generated by prepare_css_spectrograms.py and phoneme is optional. So you should make sure your form of metadata is similar to this.

If you feel confused about the parameter of config json, I suggest you to contact the author. 🙂

krigeta commented 3 years ago

Thank you so much for the explanation and I sent an email to the author and he replied to the issues hurray! ~~even bring up the issues but no response from anyone~~ so I was going through the issues and found you that you also able to train and make it work so was thinking it would be a great Idea to ask you here and yes you cleared up the things I didnt even know so thats why I am here.

i will try to setup things till the step you said and if I am able to complete it then May I ask you if I stuck on some steps?

My goal is to make a Japanese speaker to able to speak english and Hindi.

jayzhu02 commented 3 years ago

Thank you so much for the explanation and I sent an email to the author and even bring up the issues but no response from anyone so I was going through the issues and found you that you also able to train and make it work so was thinking it would be a great Idea to ask you here and yes you cleared up the things I didnt even know so thats why I am here.

i will try to setup things till the step you said and if I am able to complete it then May I ask you if I stuck on some steps?

My goal is to make a Japanese speaker to able to speak english and Hindi.

No problem. Feel free to ask🙂.

Tomiinek commented 3 years ago

Thank you @zj19980122 for you help. Would you mind to continue discussing in #48 to keep similar topics together?

jayzhu02 commented 3 years ago

Thank you @zj19980122 for you help. Would you mind to continue discussing in #48 to keep similar topics together?

My pleasure.

Tomiinek / Multilingual_Text_to_Speech

The problem of voice quality and voice conversion #41