Tomiinek / Multilingual_Text_to_Speech

An implementation of Tacotron 2 that supports multilingual experiments with parameter-sharing, code-switching, and voice cloning.
MIT License
826 stars 157 forks source link

The problem of voice quality and voice conversion #41

Closed jayzhu02 closed 3 years ago

jayzhu02 commented 3 years ago

Hello,this project is so nice and thank you for your share! I've train English and Chinese model with a total of hundreds of speakers in each language using LibriTTS and thchs30(Chinese dataset) and a private dataset. All data are resampled to 22k and denoise. This time I try to use phonemes (phonemize) and especially add tone in Chinese. Now it trains 25k steps and the loss drops well. the result is OK and it could pronounce right. But it still has problems:

  1. The inference audio always has noise.
  2. While I try to do the voice conversion(which is my main task) to let a speaker who never says Chinese to speak well, the output audio is not his voice at all. That makes me really confused.

So I'm wondering if you can give me some advice to optimize it. Thx! There are my params:

"balanced_sampling": True, "batch_size": 80, "case_sensitive": False, "checkpoint_each_epochs": 20, "encoder_dimension": 256, "encoder_type": "generated", "epochs": 1000, "generator_bottleneck_dim": 1, "generator_dim": 2, "languages": ["zh", "en"], "language_embedding_dimension": 0, "learning_rate": 0.001, "learning_rate_decay_each": 10000, "learning_rate_decay_start": 10000, "use_phonemes":True, "multi_language": True, "multi_speaker": True, "perfect_sampling": True, "predict_linear": False, "reversal_classifier": True, "reversal_classifier_dim": 256, "reversal_classifier_w": 0.125, "reversal_gradient_clipping": 0.25, "speaker_embedding_dimension": 256,

Tomiinek commented 3 years ago

Hello :slightly_smiling_face:

The parameters seem good to me.

I have a few questions:

jayzhu02 commented 3 years ago

Hello, thx for your reply!

image

Tomiinek commented 3 years ago

Hello, I am sorry for a late response.

jayzhu02 commented 3 years ago

Hello, thx for your reply!

Tomiinek commented 3 years ago

Hello again :slightly_smiling_face:

I've another question: Is that too many speakers would influence other speakers' tone? In fact, I just want to use 4 or 5 speakers but the samples of these speakers are just about 200. So I try to add other speakers to make sure they can speak well. But the result shows that It could speak well but the sound doesn't like himself/herself.

Well, 200 samples per speaker is not much. I also was not able to get stable and accurate voice for speakers with few examples (I used these low-resource speakers just to help the model to disentangle language and voice).

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

krigeta commented 3 years ago

Hello @zj19980122, I also want to train it in three languages but may you please tell me a workflow on how to setup this?

jayzhu02 commented 3 years ago

Hello @zj19980122, I also want to train it in three languages but may you please tell me a workflow on how to setup this?

Hello.

In the first two steps you may write your own code to finish. Most usage details you can find in each .py file to see how to give parameters. Follow the README.md process.

krigeta commented 3 years ago

Hello, so far I am able to set up the directories and for now I am also using the CSS10 dataset so I will get an understanding of its working.

Now before running the prepare_css_spectrograms.py file I have some questions in mind:

jayzhu02 commented 3 years ago

Hello, so far I am able to set up the directories and for now I am also using the CSS10 dataset so I will get an understanding of its working.

Now before running the prepare_css_spectrograms.py file I have some questions in mind:

  • I set up the comvoi.zip data which is in five languages, why we are setting that up to create the spectrograms, I think these are the languages later we use to get the output Please correct me on this?
  • i am using 3 Datasets Japanese, German and Chinese from CSS10 dataset and this dataset is used to train model, right?
  • How the workflow is working, like if I have a speaker in japanese with its accent then how can that japanese speaker is able to speak in Chinese and which accent will it use? I asked a lot but sorry i am new to this but this project is so much awesome that I want to learn. Please help

Sorry I can't clearly understand your first question.

If you want to use css10 as the dataset, you should create the corresponding txt files and the spectrograms since the model needs these to train.

For the third question, you can know that an audio contains the speaker's accent and words pronunciation, and MTS model could split them to train both speaker embedding and language embedding. So if your dataset has multiple speakers and languages, you can easily do the voice cloning(as your question mention). The example you can find in the notebooks/code_switching_demo.ipynb.

krigeta commented 3 years ago

I am confused on how to setup those files sir. If possible may i connect to you on other socials? like discord, email so I can explain every bit of it? or should I share it here?

jayzhu02 commented 3 years ago

I am confused on how to setup those files sir. If possible may i connect to you on other socials? like discord, email so I can explain every bit of it? or should I share it here?

If you're asking train.txt here is an example: css10_ja-css10_ja-meian_1015|css10_ja|ja|/data/css10_ja/meian/meian_1015.wav|../spectrograms/css10_ja-css10_ja-meian_1015.npy|../linear_spectrograms/css10_ja-css10_ja-meian_1015.npy|小林は覗き込むように見て云った。僕もそっちへ行くよ。彼らの行く方角には|ko̞bäjäɕi hä no̞zo̞ki ko̞mɯᵝ jo̞ɯᵝni mite̞ iʔ tä 。 bo̞kɯᵝ mo̞so̞ttɕihe̞ ikɯᵝ jo̞。 käɽe̞ɽä no̞ jɯᵝkɯᵝe̞ käkɯᵝ nihä

The form of this is: id|speaker|language|wav_path|spectrogram_path|linear_spectrogram_path|text|phoneme

spectrogram_path and linear_spectrogram_path will be automatically generated by prepare_css_spectrograms.py and phoneme is optional. So you should make sure your form of metadata is similar to this.

If you feel confused about the parameter of config json, I suggest you to contact the author. 🙂

krigeta commented 3 years ago

Thank you so much for the explanation and I sent an email to the author and he replied to the issues hurray! even bring up the issues but no response from anyone so I was going through the issues and found you that you also able to train and make it work so was thinking it would be a great Idea to ask you here and yes you cleared up the things I didnt even know so thats why I am here.

i will try to setup things till the step you said and if I am able to complete it then May I ask you if I stuck on some steps?

My goal is to make a Japanese speaker to able to speak english and Hindi.

jayzhu02 commented 3 years ago

Thank you so much for the explanation and I sent an email to the author and even bring up the issues but no response from anyone so I was going through the issues and found you that you also able to train and make it work so was thinking it would be a great Idea to ask you here and yes you cleared up the things I didnt even know so thats why I am here.

i will try to setup things till the step you said and if I am able to complete it then May I ask you if I stuck on some steps?

My goal is to make a Japanese speaker to able to speak english and Hindi.

No problem. Feel free to ask🙂.

Tomiinek commented 3 years ago

Thank you @zj19980122 for you help. Would you mind to continue discussing in #48 to keep similar topics together?

jayzhu02 commented 3 years ago

Thank you @zj19980122 for you help. Would you mind to continue discussing in #48 to keep similar topics together?

My pleasure.