coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
35.41k stars 4.32k forks source link

Errors when trying to train SC-GlowTTS #483

Closed The0nix closed 3 years ago

The0nix commented 3 years ago

Describe the bug I am trying to train SC-GlowTTS model. I downloaded the config from the latest release and tried to launch TTS/bin/train_glow_tts.py. However, I face different errors regarding the missing values in the config. First it was stats_path, then use_noise_augment and now I get AssertionError: 22050 vs 48000, despite the fact that configs state "wav sample-rate. If different than the original data, it is resampled". What is the proper way to train SC-GlowTTS? :)

To Reproduce Steps to reproduce the behavior:

  1. Download and unzip SC-GlowTTS config from v0.0.13 release (https://github.com/coqui-ai/TTS/releases/download/v0.0.12/tts_models--en--vctk--sc-glowtts-transformer.zip)
  2. Download and unzip VCTK dataset e. g. from here (link from SC-GlowTTS repo)
  3. Substitute dataset path in config for yours
  4. Download and install glow TTS: git clone https://github.com/coqui-ai/TTS && cd TTS && pip install -e .
  5. Execute with your config path from TTS directory: python TTS/bin/train_glow_tts.py --config_path /path/to/config/

Expected behavior The model trains without errors

Environment (please complete the following information):

erogol commented 3 years ago

it is not resampled. Either you need to resample them in advance or set resample: true in the config file. It is an old comment.

The0nix commented 3 years ago

Oh, ok, thank you! I resampled it anyway. How was I to know this, did I miss something?

And what is the intended way of dealing with missing parameters in config? I can go through the options that were missing for SC-GlowTTS and add some meaningful default values into the code. However that would also require adding default values for future new parameters and this makes maintaining harder. May be a new type of configs with some schema and default values would be better (I saw you discussing new configs in a parallel issue). Or may be it would be reasonable to keep commit hash with configs for a version that is definitely working?

I can also add GlowTTS config to the repo and edit comments about resampling in existing configs.

The0nix commented 3 years ago

And there was also data_dep_init_iter parameter which I assume needs to be set to 1 but I am not sure.

erogol commented 3 years ago

you can check here for default values https://github.com/coqui-ai/TTS/blob/coqpit-refactor/TTS/tts/configs/glow_tts_config.py

Soon we plan to refactor the config management system of TTS and this is the config from that branch.

loganhart02 commented 3 years ago

I was also trying to train SC-GlowTTS on the LibriTTS clean-100 and clean-360 dataset and as I did get it to train it comes out robotic. my config is the same as the released sc-glow model except I have a batch size of 40(due to gpu memory), I set trim silences to True and I use english US characters. I was wondering if you had any tips to recreate the results from the paper in terms of naturalness?

erogol commented 3 years ago

Maybe @Edresson can help as the one who trained the models.

My take is that LibriTTS is a harder dataset and more difficult to reach the same quality.

Edresson commented 3 years ago

@loganhart420 Hello :) Which speaker encoder are you using? How many training steps? Can you share some samples?

loganhart02 commented 3 years ago

@loganhart420 Hello :) Which speaker encoder are you using? How many training steps? Can you share some samples?

hey @Edresson :) I'm using the AngleProtoloss(One I trained myself on all the same the datasets and basically the same config, I just tweaked the audio params) I have been training for 50-60k steps and what I am noticing is only a slight difference in audio quality and similarity from 4k steps to 60k,

The attatched audio sample is what it sounded like at 9k steps and there was no difference in quality or similarity at 60k steps. depending on the audio params I was able to get it to sound slightly better in some trainings and way worst in others.

https://drive.google.com/file/d/16Nq8vJilh_8Vtb9E5heeDEL0hDLtotBJ/view?usp=sharing

loganhart02 commented 3 years ago

@loganhart420 Hello :) Which speaker encoder are you using? How many training steps? Can you share some samples?

hey @Edresson :) I'm using the AngleProtoloss(One I trained myself on all the same the datasets and basically the same config, I just tweaked the audio params) I have been training for 50-60k steps and what I am noticing is only a slight difference in audio quality and similarity from 4k steps to 60k,

The attatched audio sample is what it sounded like at 9k steps and there was no difference in quality or similarity at 60k steps. depending on the audio params I was able to get it to sound slightly better in some trainings and way worst in others.

https://drive.google.com/file/d/16Nq8vJilh_8Vtb9E5heeDEL0hDLtotBJ/view?usp=sharing

oh also you will need to open that google drive in the video player idk why its not letting it play straight up at least on my end.

Edresson commented 3 years ago

@loganhart420 I recommend that you use the same speaker encoder used in the paper and available here (trained by 330k steps).

In SC-GlowTTS the quality of the speaker encoder is fundamental because it doesn't receive any extra information from the speaker.

As your batch size is smaller, you should train more. In addition, in the article, we trained the model by 150k steps using the VCTK, which is much smaller and has only 108 speakers. So as you are training in a larger dataset, you need to train more steps.

loganhart02 commented 3 years ago

Great! Thanks for advice I'll download the encoder and have it train for at least twice as long and see how it's performing. If I end up producing anything good I'll make sure to share :)

zubairahmed-ai commented 3 years ago

it is not resampled. Either you need to resample them in advance or set resample: true in the config file. It is an old comment.

This helped me too, more here

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

erogol commented 3 years ago

Ping again if this issue is still an issue

DesiKeki commented 2 years ago

Great! Thanks for advice I'll download the encoder and have it train for at least twice as long and see how it's performing. If I end up producing anything good I'll make sure to share :)

Any luck on this @loganhart420 ? I am facing the exact same issue with the same robotic result. Please help if you could resolve it.