TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.85k stars 815 forks source link

Need advice on training Russian model #412

Closed Nistrian closed 3 years ago

Nistrian commented 3 years ago

Hello! I am very impressed with your work, it provides me with many opportunities and I rely on it a lot. Some time ago I asked questions about solving some problems when training a model in Russian. I overcame all these difficulties and trained the model using mfa. However, the results were disappointing. The model produces speech that is really similar to Russian, but it cannot be called "good". Speech quality is not at all similar to that obtained with a pretrained English model. There is a big problem with emphasis. Since I have very little knowledge in this area, I would like to receive advice on what features should be taken into account when teaching the Russian-speaking model. If there are people who have been able to teach a good Russian model, I would be very grateful for the help.

dathudeptrai commented 3 years ago

I have very little knowledge in this area, I would like to receive advice on what features should be taken into account when teaching the Russian-speaking model. If there are people who have been able to teach a good Russian model, I would be very grateful for the help.

hi, can you provide some information about ur dataset (hz, number of sample, multi-speaker or not, ...) and what you did.

Nistrian commented 3 years ago

@dathudeptrai My dataset consists of 25 hours of recording from one speaker. Sample length from 3 to 10 seconds, 22050 Hz. I followed the scripts described in examples/mfa_extraction

Initially, I extracted the durations by specifying the configuration for one speaker (examples/fastspeech2/conf/fastspeech2.v1.yaml), as well as specifying the type of dataset ljspeech, which entailed an error with the dimensions. Later, I followed the entire extraction algorithm exactly as described in the readme(1,2,4,5 points). To do this, I changed the transcription to the form: name_file | text | name_speaker The speaker's name was the same everywhere. All this allowed me to train a model with the characteristics described earlier. Here are the training schedules:

image

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.