Finetune XTTS for new languages

anhnh2002 commented 2 months ago

Hello everyone, below is my code for fine-tuning XTTS for a new language. It works well in my case with over 100 hours of audio.

https://github.com/nguyenhoanganh2002/XTTSv2-Finetuning-for-New-Languages

jamestech-cmyk commented 2 months ago

Hello everyone, below is my code for fine-tuning XTTS for a new language. It works well in my case with over 100 hours of audio.

https://github.com/nguyenhoanganh2002/XTTSv2-Finetuning-for-New-Languages

Hello, man. I'm very pleased with your contribution. Can you provide your trained models? I want to check if they are working well.

anhnh2002 commented 2 months ago

Hello everyone, below is my code for fine-tuning XTTS for a new language. It works well in my case with over 100 hours of audio. https://github.com/nguyenhoanganh2002/XTTSv2-Finetuning-for-New-Languages

Hello, man. I'm very pleased with your contribution. Can you provide your trained models? I want to check if they are working well.

Due to copyright issues, I am currently unable to share the model's weights with you. I apologize for the inconvenience.

jamestech-cmyk commented 2 months ago

How long did it take you to train 100 hours of audio, and can you tell me your current computer configuration?

anhnh2002 commented 2 months ago

it took over 8 hours to train 100 hours of audio on single A100 40Gb

How long did it take you to train 100 hours of audio, and can you tell me your current computer configuration?

mohataher commented 1 month ago

Due to copyright issues, I am currently unable to share the model's weights with you.

Understandable. However, will you be able to share a snippet audio of what the model has produced?

anhnh2002 commented 1 month ago

Due to copyright issues, I am currently unable to share the model's weights with you.

Understandable. However, will you be able to share a snippet audio of what the model has produced?

Please find the relevant file at the following Google Drive link:

View File

rose07 commented 1 month ago

https://tts.byylook.com/ai/text-to-speech

developeranalyser commented 1 month ago

Due to copyright issues, I am currently unable to share the model's weights with you.

Understandable. However, will you be able to share a snippet audio of what the model has produced?

Please find the relevant file at the following Google Drive link:

View File

hi man
what your taken loass ? and how many step ?

Is it possible to train the xttsv2 model for about 10 hours and can it work well only based on these 10 hours?

Actually, I trained the model with your code and reached a loss of 0.5 and used the model and the output was very bad and nothing was audible. I used google/fleurs dataset for Farsi language. First, I expanded vocab, then dave training, and then model training for 10,000 steps What do you think, why am I getting so bad results?

Thank you very much

anhnh2002 commented 1 month ago

Due to copyright issues, I am currently unable to share the model's weights with you.

Understandable. However, will you be able to share a snippet audio of what the model has produced?

Please find the relevant file at the following Google Drive link: View File

hi man what your taken loass ? and how many step ?

Is it possible to train the xttsv2 model for about 10 hours and can it work well only based on these 10 hours?

Actually, I trained the model with your code and reached a loss of 0.5 and used the model and the output was very bad and nothing was audible. I used google/fleurs dataset for Farsi language. First, I expanded vocab, then dave training, and then model training for 10,000 steps What do you think, why am I getting so bad results?

Thank you very much

First, I recommend you do not train DVAE (because you have a small amount of data). And I think 10 hours is not enough; it makes the model overfit with your data. The losses I got are about 0.8.

developeranalyser commented 1 month ago

thanks for your good job and reply i do that and loss : | > avg_loader_time: 0.18475866317749023 (+0.00680994987487793) | > avg_loss_text_ce: 0.036836352199316025 (-0.0016442164778709412) | > avg_loss_mel_ce: 0.03139156103134155 (-0.001425366848707199) | > avg_loss: 0.06822791695594788 (-0.003069579601287842)

but after inference one of sentence that trained on i get worse audio that not in trained lang And even the sound that is produced is not close to the trained language at all

result.zip result.zip

developeranalyser commented 1 month ago

How many epochs and steps are required for training on 100 hours of data? And it took a few hours my friend

kunibald413 commented 1 month ago

Hi, nice work! You might want to try to create a merge request for it into a still maintained fork of coqui-ai: https://github.com/idiap/coqui-ai-TTS

I'm not involved with it, just an idea.

anhnh2002 commented 1 month ago

How many epochs and steps are required for training on 100 hours of data? And it took a few hours my friend

2 epochs work well for me

developeranalyser commented 1 month ago

2 epochs work well for me for new lang , after train we need train vocoder ?

and If lose decreases and becomes less than 1, but it still reads the text incorrectly, what is your opinion about this? What do you advise me to do to solve this problem, maybe my important problem is solved thank you

developeranalyser commented 1 month ago

I don't want to train the model on the whole language

I want to teach on limited sentences of a new language For example, on 1000 sentences What is your opinion about this??? Is it possible??

anhnh2002 commented 1 month ago

I don't want to train the model on the whole language

I want to teach on limited sentences of a new language

For example, on 1000 sentences

What is your opinion about this??? Is it possible??

I think it's impossible to overfit the model with only 1000 sentences, especially for a new language. You'd need to extend the tokenizer and likely train a base model on a larger dataset of that language first.

developeranalyser commented 1 month ago

I think it's impossible to overfit the model with only 1000 sentences, especially for a new language. You'd need to extend the tokenizer and likely train a base model on a larger dataset of that language first.

Thank you very much, so your opinion is that my problem is the small amount of data and I cannot get good results from this model that I have trained on few sentences and it must be trained on a large amount of data. I expanded vocab and taught dave Honestly, I wanted to test first that the model is trained on little data and how the result will be, then run it on a lot of data. Another question I have is how much lr should I put??? That the learning of the model for other languages is not lost and that the model learns well and quickly for a new language and on a lot of data

Thank you for paying the zakat of your knowledge :)

developeranalyser commented 1 month ago

I don't want to train the model on the whole language

I want to teach on limited sentences of a new language For example, on 1000 sentences What is your opinion about this??? Is it possible??

I think it's impossible to overfit the model with only 1000 sentences, especially for a new language. You'd need to extend the tokenizer and likely train a base model on a larger dataset of that language first.

In short, teaching a language with 10 letters and about 100 sentences is not possible? So that the model reads these 100 trained sentences correctly?

NathanTrance commented 1 week ago

Hey, great work!

I am having a question: I want to train this model on Vietnamese, but with vi-north and vi-south as separate languages and have separate metadata csvs for them. Does the multidataset training option support this and shuffle both the vi-north and vi-south data together with separate languages beforehand?

Thank you in advance!

anhnh2002 commented 1 week ago

Yes, you can

coqui-ai / TTS