DigitalPhonetics / IMS-Toucan

Controllable and fast Text-to-Speech for over 7000 languages!
Apache License 2.0
1.47k stars 166 forks source link

Which recipe to use for training from scratch? #182

Closed AlexSteveChungAlvarez closed 3 months ago

AlexSteveChungAlvarez commented 4 months ago

@Flux9665 I want to try training the new version of the model, I see there are 4 recipes:

Which should I use?

Flux9665 commented 4 months ago

For multilingual and multispeaker setups, Toucan_Massive_stage1 is probably the best when training from scratch. I recommend finetuning instead of training from scratch, for which there are the finetuning example recipes.

It might be worth to wait a couple more days, I'm finalizing the next release right now which comes with a much more stable architecture.

AlexSteveChungAlvarez commented 4 months ago

For multilingual and multispeaker setups, Toucan_Massive_stage1 is probably the best when training from scratch. I recommend finetuning instead of training from scratch, for which there are the finetuning example recipes.

When doing finetuning does it improve the base quality? I want to train from scratch with high quality data. I heard some of the data you released and didn't sound as HQ. By the way, the released data seems to only have 4 languages, but in the paper you mentioned several ones (including Quechua and Aymara) are these languages going to be shared too?

It might be worth to wait a couple more days, I'm finalizing the next release right now which comes with a much more stable architecture.

What do you mean by "stable"? I will try it for sure!

I started the Toucan_Massive recipe yesterday to have the preprocessed .pt files of the datasets when you answered me about the recipe, but it has already run for more than 12 hours and it's still in the first dataset (Libritts_R_all_clean) and stucked in 98%

image

This are the processes nvidia-smi shows:

image

image

AlexSteveChungAlvarez commented 4 months ago

@Flux9665 I run the stage 1 and it already has 35 hours and a half preprocessing only libritts_R_all_clean and the cache.pt file hasn't appeared yet. The first 4 hours it run smoothly until 99%, then went down to 98% and got stucked there. Preprocessing the same dataset with previous toucan version took only some hours to finish, is this normal?

AlexSteveChungAlvarez commented 4 months ago

Later I run stage 1 with no luck neither: image about 50 hours and a half processing only libritts_r_all_clean and no cache.pt file is built. I don't know if this is normal behaviour or not @Flux9665 . Anyways, I hope this is solved in the coming release!

Flux9665 commented 4 months ago

When doing finetuning does it improve the base quality? I want to train from scratch with high quality data. I heard some of the data you released and didn't sound as HQ. By the way, the released data seems to only have 4 languages, but in the paper you mentioned several ones (including Quechua and Aymara) are these languages going to be shared too?

Yes, finetuning improves quality.

The data we released is not from our model, it's from the MMS model by Meta.

We released all data, huggingface only shows some parts of the first 4 in the preview because it is too big.

Aymara is not included in this data, we just evaluated our model with the help of Aymara speakers.

I run the stage 1 and it already has 35 hours and a half preprocessing only libritts_R_all_clean and the cache.pt file hasn't appeared yet.

It sounds like your machine does not have enough RAM and is using the SWAP. Try splitting the large dataset into smaller chunks.

AlexSteveChungAlvarez commented 3 months ago

I just saw there's a pttd_cache.pt file but in the original directory of the data, differently from v2 where it was created on the target directory of the data, under the IMS-Toucan directory. But it is of only 31 MB, so I'll try splitting the libritts-r data as you do with mls_german.

Flux9665 commented 3 months ago

The pttd_cache.pt is just the cache of the filepaths to transcriptions, not the actual audio. I removed these caches in the newest version, they were kind of unnecessary.

khanifah-gif commented 1 month ago

For multilingual and multispeaker setups, Toucan_Massive_stage1 is probably the best when training from scratch. I recommend finetuning instead of training from scratch, for which there are the finetuning example recipes.

It might be worth to wait a couple more days, I'm finalizing the next release right now which comes with a much more stable architecture.

Hello, how about monolingual training from stratch? Which recipe to use for training?

Flux9665 commented 1 month ago

For monolingual training, you can adapt the Nancy recipe.

khanifah-gif commented 1 month ago

For monolingual training, you can adapt the Nancy recipe.

Awesome, thanks a lot! I'll check this out.

khanifah-gif commented 1 month ago

For monolingual training, you can adapt the Nancy recipe.

Normally, if I do training from stratch, are the model result also controllable for the parameters like energy, pitch, scaling duration, etc? I have done training from stratch, but when I tried to control the parameters, the audio results do not change.