Closed jayzhu02 closed 3 years ago
Hello :slightly_smiling_face:
The parameters seem good to me.
I have a few questions:
Hello, thx for your reply!
For the first question, I looked at the tensorboard's generated audio and it sounds ok. But during inference, all the audio always has noise in the background sound no matter which speaker is.
Speaker distribution is here. Most of the speakers have 50-100 samples. Since I use some public datasets, it's maybe hard for me to clusters male/female voice. I'll have a try.
Hello, I am sorry for a late response.
Hello, thx for your reply!
Hello again :slightly_smiling_face:
I've another question: Is that too many speakers would influence other speakers' tone? In fact, I just want to use 4 or 5 speakers but the samples of these speakers are just about 200. So I try to add other speakers to make sure they can speak well. But the result shows that It could speak well but the sound doesn't like himself/herself.
Well, 200 samples per speaker is not much. I also was not able to get stable and accurate voice for speakers with few examples (I used these low-resource speakers just to help the model to disentangle language and voice).
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hello @zj19980122, I also want to train it in three languages but may you please tell me a workflow on how to setup this?
Hello @zj19980122, I also want to train it in three languages but may you please tell me a workflow on how to setup this?
Hello.
In the first two steps you may write your own code to finish. Most usage details you can find in each .py file to see how to give parameters. Follow the README.md process.
Hello, so far I am able to set up the directories and for now I am also using the CSS10 dataset so I will get an understanding of its working.
Now before running the prepare_css_spectrograms.py file I have some questions in mind:
I set up the comvoi.zip data which is in five languages, why we are setting that up to create the spectrograms, I think these are the languages later we use to get the output Please correct me on this?
i am using 3 Datasets Japanese, German and Chinese from CSS10 dataset and this dataset is used to train model, right?
How the workflow is working, like if I have a speaker in japanese with its accent then how can that japanese speaker is able to speak in Chinese and which accent will it use? I asked a lot but sorry i am new to this but this project is so much awesome that I want to learn. Please help
Hello, so far I am able to set up the directories and for now I am also using the CSS10 dataset so I will get an understanding of its working.
Now before running the prepare_css_spectrograms.py file I have some questions in mind:
- I set up the comvoi.zip data which is in five languages, why we are setting that up to create the spectrograms, I think these are the languages later we use to get the output Please correct me on this?
- i am using 3 Datasets Japanese, German and Chinese from CSS10 dataset and this dataset is used to train model, right?
- How the workflow is working, like if I have a speaker in japanese with its accent then how can that japanese speaker is able to speak in Chinese and which accent will it use? I asked a lot but sorry i am new to this but this project is so much awesome that I want to learn. Please help
Sorry I can't clearly understand your first question.
If you want to use css10 as the dataset, you should create the corresponding txt files and the spectrograms since the model needs these to train.
For the third question, you can know that an audio contains the speaker's accent and words pronunciation, and MTS model could split them to train both speaker embedding and language embedding. So if your dataset has multiple speakers and languages, you can easily do the voice cloning(as your question mention). The example you can find in the notebooks/code_switching_demo.ipynb.
I am confused on how to setup those files sir. If possible may i connect to you on other socials? like discord, email so I can explain every bit of it? or should I share it here?
I am confused on how to setup those files sir. If possible may i connect to you on other socials? like discord, email so I can explain every bit of it? or should I share it here?
If you're asking train.txt here is an example:
css10_ja-css10_ja-meian_1015|css10_ja|ja|/data/css10_ja/meian/meian_1015.wav|../spectrograms/css10_ja-css10_ja-meian_1015.npy|../linear_spectrograms/css10_ja-css10_ja-meian_1015.npy|小林は覗き込むように見て云った。僕もそっちへ行くよ。彼らの行く方角には|ko̞bäjäɕi hä no̞zo̞ki ko̞mɯᵝ jo̞ɯᵝni mite̞ iʔ tä 。 bo̞kɯᵝ mo̞so̞ttɕihe̞ ikɯᵝ jo̞。 käɽe̞ɽä no̞ jɯᵝkɯᵝe̞ käkɯᵝ nihä
The form of this is:
id|speaker|language|wav_path|spectrogram_path|linear_spectrogram_path|text|phoneme
spectrogram_path and linear_spectrogram_path will be automatically generated by prepare_css_spectrograms.py and phoneme is optional. So you should make sure your form of metadata is similar to this.
If you feel confused about the parameter of config json, I suggest you to contact the author. 🙂
Thank you so much for the explanation and I sent an email to the author and he replied to the issues hurray! even bring up the issues but no response from anyone so I was going through the issues and found you that you also able to train and make it work so was thinking it would be a great Idea to ask you here and yes you cleared up the things I didnt even know so thats why I am here.
i will try to setup things till the step you said and if I am able to complete it then May I ask you if I stuck on some steps?
My goal is to make a Japanese speaker to able to speak english and Hindi.
Thank you so much for the explanation and I sent an email to the author and even bring up the issues but no response from anyone so I was going through the issues and found you that you also able to train and make it work so was thinking it would be a great Idea to ask you here and yes you cleared up the things I didnt even know so thats why I am here.
i will try to setup things till the step you said and if I am able to complete it then May I ask you if I stuck on some steps?
My goal is to make a Japanese speaker to able to speak english and Hindi.
No problem. Feel free to ask🙂.
Thank you @zj19980122 for you help. Would you mind to continue discussing in #48 to keep similar topics together?
Thank you @zj19980122 for you help. Would you mind to continue discussing in #48 to keep similar topics together?
My pleasure.
Hello,this project is so nice and thank you for your share! I've train English and Chinese model with a total of hundreds of speakers in each language using LibriTTS and thchs30(Chinese dataset) and a private dataset. All data are resampled to 22k and denoise. This time I try to use phonemes (phonemize) and especially add tone in Chinese. Now it trains 25k steps and the loss drops well. the result is OK and it could pronounce right. But it still has problems:
So I'm wondering if you can give me some advice to optimize it. Thx! There are my params:
"balanced_sampling": True, "batch_size": 80, "case_sensitive": False, "checkpoint_each_epochs": 20, "encoder_dimension": 256, "encoder_type": "generated", "epochs": 1000, "generator_bottleneck_dim": 1, "generator_dim": 2, "languages": ["zh", "en"], "language_embedding_dimension": 0, "learning_rate": 0.001, "learning_rate_decay_each": 10000, "learning_rate_decay_start": 10000, "use_phonemes":True, "multi_language": True, "multi_speaker": True, "perfect_sampling": True, "predict_linear": False, "reversal_classifier": True, "reversal_classifier_dim": 256, "reversal_classifier_w": 0.125, "reversal_gradient_clipping": 0.25, "speaker_embedding_dimension": 256,