DigitalPhonetics / IMS-Toucan

Multilingual and Controllable Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart.
Apache License 2.0
1.36k stars 152 forks source link

Adding a New Language #45

Closed Winchester37 closed 1 year ago

Winchester37 commented 1 year ago

First of all, many thanks for a great repo. I'm kind of new to this stuff, please forgive me. Can we train a new speaker and language using this repo, for example Turkish. I would be very grateful if you could provide information on what the structure of the data set should be and how it was prepared.

Flux9665 commented 1 year ago

Hi! Yes, enabling training models for new languages is one of the core goals of this repo. I tried my best to write detailed instructions in the README on how to train a new model on a new language.

Generally you don't need to train an aligner and a vocoder, those two components can just be downloaded from the release page and used as they are. For the TTS model that goes from text to spectrogram frames, you need to train a model. You can either do that from scratch, or you can download the meta-learning checkpoint from the releases and finetune this model to new data. The second option is generally recommended, because it's faster and can get by with fewer data.

The dataset can be in any format you have available, you just need to write a function that returns a dictionary with the keys being the path to a wav file and the value for that key being the transcription of what is said in that audio. For Turkish you'd also have to add handling in the text processing, because Turkish is not yet supported. (This will change in a few versions, we will add support for a lot more languages.) Simply looking up the ID of Turkish in the espeak phonemizer (https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md) and adding it to the TextFrontend should be enough, like in the following:

https://github.com/DigitalPhonetics/IMS-Toucan/blob/4af4c7c04dc5bffae225dc507727261f702738fb/Preprocessing/TextFrontend.py#L128

You will then also need to add a sentence for testing the learning-progress in the train loop, since it's a new language, like in the following:

https://github.com/DigitalPhonetics/IMS-Toucan/blob/4af4c7c04dc5bffae225dc507727261f702738fb/TrainingInterfaces/Text_to_Spectrogram/FastSpeech2/fastspeech2_train_loop.py#L53

To train a model with the new data, have a look at the finetuning example:

https://github.com/DigitalPhonetics/IMS-Toucan/blob/4af4c7c04dc5bffae225dc507727261f702738fb/TrainingInterfaces/TrainingPipelines/FastSpeech2_finetuning_example.py#L31

There are a bunch of comments that guide you through the changes you need to make.

AlexSteveChungAlvarez commented 1 year ago

Hello, I am trying to fine-tune with Quechua, but got this error: python run_training_pipeline.py fine_qu --gpu_id 0,1 --resume_checkpoint /Models/FastSpeech2_Meta --finetune --model_save_dir /home/luis/Documentos/VCQuechua/IMS-Toucan/Models/Quechua Preparing Prepared an Aligner dataset with 0 datapoints in Corpora/Quechua. Traceback (most recent call last): File "run_training_pipeline.py", line 77, in <module> pipeline_dict[args.pipeline](gpu_id=args.gpu_id, File "/home/luis/Documentos/VCQuechua/IMS-Toucan/TrainingInterfaces/TrainingPipelines/FastSpeech2_finetuning_example.py", line 36, in run datasets.append(prepare_fastspeech_corpus(transcript_dict=build_path_to_transcript_dict_quechua(), File "/home/luis/Documentos/VCQuechua/IMS-Toucan/Utility/corpus_preparation.py", line 38, in prepare_fastspeech_corpus train_aligner(train_dataset=aligner_datapoints, File "/home/luis/Documentos/VCQuechua/IMS-Toucan/TrainingInterfaces/Text_to_Spectrogram/AutoAligner/autoaligner_train_loop.py", line 48, in train_loop train_loader = DataLoader(batch_size=batch_size, File "/home/luis/Documentos/VCQuechua/IMS-Toucan/toucan_conda_venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 344, in __init__ sampler = RandomSampler(dataset, generator=generator) # type: ignore[arg-type] File "/home/luis/Documentos/VCQuechua/IMS-Toucan/toucan_conda_venv/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 107, in __init__ raise ValueError("num_samples should be a positive integer " ValueError: num_samples should be a positive integer value, but got num_samples=0 Any ideas? I made all the changes you mention here. I had trouble with espeak since installing it with apt-get installed an older version which didn't support Quechua, so I had to install it from source and the first archive from the release was incomplete, I uninstalled the apt-get installation and installed from source the release source code and now I got this error...

Flux9665 commented 1 year ago

It says in your output Prepared an Aligner dataset with 0 datapoints in Corpora/Quechua. So apparently something went wrong with the creation of the data cache. If this is from an earlier attempt, you can just delete the cache and try again. If the same message is printed, there are a bunch of things to check. If there is no other error-message that comes before this output telling you that there are no datapoints, it might mean that the path to transcript dictionary you input to the dataset creation method has a problem. You could print the dict and make sure that it actually has some content, that the paths are absolute and that the values are not empty.

AlexSteveChungAlvarez commented 1 year ago

Hi, just some minutes ago I came across the solution too (I didn't notice you had answered), the cache file was created from the first attempt, but in that moment I had de espeak 1.50 which didn't support quechua, so it didn't work, I deleted it, so it could create the real cache. Thank you for answering soon! The next error I got was because 10 audios were empty, I didn't notice that. By the way, for all who will see this issue in the future, the only command needed to finetune on your data is: python run_training_pipeline.py <Name of your pipeline> --gpu_id <the ID(s) of the GPU(s) you are going to use> There's no need to edit the other arguments unless you trained from the beginning the model and have the pretrained file in another directory. Remember the name given to your pipeline is the key you introduce in the pipeline_dict with value as the run function imported from the finetuning script.

AlexSteveChungAlvarez commented 1 year ago

After 1 hour of training I got this new error which has something to do with the espeak installation I did I think: Epoch: 1945 Spectrogram Loss: 0.20996291935443878 Time elapsed: 74 Minutes Steps: 3890 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 5.41it/s] Traceback (most recent call last): File "run_training_pipeline.py", line 77, in <module> pipeline_dict[args.pipeline](gpu_id=args.gpu_id, File "/home/luis/Documentos/VCQuechua/IMS-Toucan/TrainingInterfaces/TrainingPipelines/FastSpeech2_finetuning_example.py", line 52, in run train_loop(net=model, File "/home/luis/Documentos/VCQuechua/IMS-Toucan/TrainingInterfaces/Text_to_Spectrogram/FastSpeech2/fastspeech2_train_loop.py", line 271, in train_loop path_to_most_recent_plot = plot_progress_spec(net, File "/home/luis/Documentos/VCQuechua/IMS-Toucan/toucan_conda_venv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/luis/Documentos/VCQuechua/IMS-Toucan/TrainingInterfaces/Text_to_Spectrogram/FastSpeech2/fastspeech2_train_loop.py", line 27, in plot_progress_spec tf = ArticulatoryCombinedTextFrontend(language=lang) File "/home/luis/Documentos/VCQuechua/IMS-Toucan/Preprocessing/TextFrontend.py", line 168, in __init__ self.phonemizer_backend = EspeakBackend(language=self.g2p_lang, File "/home/luis/Documentos/VCQuechua/IMS-Toucan/toucan_conda_venv/lib/python3.8/site-packages/phonemizer/backend/espeak/espeak.py", line 45, in __init__ super().__init__( File "/home/luis/Documentos/VCQuechua/IMS-Toucan/toucan_conda_venv/lib/python3.8/site-packages/phonemizer/backend/espeak/base.py", line 45, in __init__ self._espeak = EspeakWrapper() File "/home/luis/Documentos/VCQuechua/IMS-Toucan/toucan_conda_venv/lib/python3.8/site-packages/phonemizer/backend/espeak/wrapper.py", line 60, in __init__ self._espeak = EspeakAPI(self.library()) File "/home/luis/Documentos/VCQuechua/IMS-Toucan/toucan_conda_venv/lib/python3.8/site-packages/phonemizer/backend/espeak/api.py", line 84, in __init__ self._library = ctypes.cdll.LoadLibrary(str(espeak_copy)) File "/home/luis/Documentos/VCQuechua/IMS-Toucan/toucan_conda_venv/lib/python3.8/ctypes/__init__.py", line 451, in LoadLibrary return self._dlltype(name) File "/home/luis/Documentos/VCQuechua/IMS-Toucan/toucan_conda_venv/lib/python3.8/ctypes/__init__.py", line 373, in __init__ self._handle = _dlopen(self._name, mode) OSError: /tmp/tmpswtj5jb9/libespeak-ng.so.1.1.51: failed to map segment from shared object Any ideas on this? I had trouble at the beginning with installing this version, I followed the instructions here to build and install it from source...

Flux9665 commented 1 year ago

By the way, for all who will see this issue in the future, the only command needed to finetune on your data is: python run_training_pipeline.py --gpu_id <the ID(s) of the GPU(s) you are going to use>

Not fully correct, this is usually for training a model from scratch, starting with random initialization. Finetuning would mean that you don't start from scratch, but instead you start from a different model and adapt it to fit the new data. It's just that in the FastSpeech2_finetuning_example.py this is already hardcoded, so you don't need to specify the other arguments that you would usually need (--resume_checkpoint <path to pretrained model> --finetune). I think at some point I will write a script that automates this even further, so that finetuning on a new language becomes super easy. I will try to increase the supported languages to many many more soon.

Regarding your new error: Whenever a checkpoint is saved, a spectrogram is generated and saved as an image to visualize the training progress. Generating this spectrogram is treated the same way as if you were doing inference normally, so the text goes in, is phonemized, then it is synthesized. Your error occurs at this point, where the text is phonemized during inference. The error has however nothing to do with anything that the toolkit does and it also doesn't seem to be an error in espeak either, it probably has something to do with your system. I found a few suggestions here: https://stackoverflow.com/questions/13502156/what-are-possible-causes-of-failed-to-map-segment-from-shared-object-operation

As a sidenote: If you only have enough datapoints to fill two batches per epoch, you have to be careful to not train for too long, because otherwise the model might forget things if it only sees the same few datapoints over and over. In TTS, a lower loss does not necessarily mean a better model. It's best to save checkpoints every few epochs and just try them out to see at which point it stops improving and gets worse again.

AlexSteveChungAlvarez commented 1 year ago

I just resume the training later and it worked again, I don't really know why that error ocurred. I would like to know how long should I train it. In your paper it says you trained with only five minutes for new languages with good results, I want to do something similar with Quechua, I found this little dataset on internet and I am trying on it first, but then I will build my own with only 5 minutes as you did in your work...You say in the instructions in the Readme that 200 000 steps are enough for FastSpeech2, but maybe that's too much in this case...I would like to know also if the 5 min datasets you used were of just 1 audio of 5 min or many audios that in total lasted 5 min...

Flux9665 commented 1 year ago

yes, the 200,000 steps are meant for training a model from scratch. Training from scratch however needs a bit more data than just 5 minutes. I found that 30 minutes can already be enough for that.

For finetuning, you only need a few thousand steps, but it might vary depending on the data itself. The data I used were 50 samples that added up to a duration of 301 seconds and I stopped noticing improvements after ~4,000 steps. the performance then got worse at around 60,000 steps because it started overfitting too much.

AlexSteveChungAlvarez commented 1 year ago

OK, so my model is finetuned for 7786 steps by now. I tried it with the run_interactive_demo.py script and the results are: 1) The language seems to be learnt correctly (I can't validate it myself since I don't speak it, but it sounds pretty good). 2) It's conditioned to the speaker of the dataset (and to the language too). When I tried to clone my own voice with my model, it reproduces the voice of the speaker of the dataset and when I tried to clone in english, it does it too, and it sounds with the Quechua accent! By the way, when I try on the Meta model given by default with my voice, it doesn't clone my voice either.

Flux9665 commented 1 year ago
  1. Sounds great!
  2. Yes, when you only finetune of the 5 minutes, it forgets about multispeaker and multilingual properties. To keep those, you can just jointly train on multispeaker datasets in high-resource languages and mix your 5 minute corpus with low resource data in. Because of the way we sample batches, the small dataset will considered equally as much as the large dataset(s). This way it learns the new language, but keeps the other properties. And about the voice cloning in general: is is not very exact. We're currently working on making it better, the models in the recent release will be updated soon, but it's still far from perfect. It will always just procude a similar voice, not an exact replication, which is also not our goal, as that would make impersonation a bit too easy for my liking. If you train on data created with your own voice, then it will adapt much better, but zero-shot adaptation is a bit limited.
AlexSteveChungAlvarez commented 1 year ago

So, I am using the train_loop from meta_train_loop to train on LibriTTS + Quechua (Single Speaker) dataset with the finetuning example script, and I got the following error:

python run_training_pipeline.py fine_qu --gpu_id 0,1
Preparing
Prepared a FastSpeech dataset with 33 datapoints in Corpora/Quechua.
Prepared a FastSpeech dataset with 23854 datapoints in Corpora/LibriTTS.
Training model
  0%|                                                                                                             | 0/100000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/luis/Documentos/VCQuechua/IMS-Toucan/TrainingInterfaces/Text_to_Spectrogram/FastSpeech2/meta_train_loop.py", line 189, in train_loop
    batch = next(train_iters[index])
  File "/home/luis/Documentos/VCQuechua/IMS-Toucan/toucan_conda_venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/luis/Documentos/VCQuechua/IMS-Toucan/toucan_conda_venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1306, in _next_data
    raise StopIteration
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_training_pipeline.py", line 77, in <module>
    pipeline_dict[args.pipeline](gpu_id=args.gpu_id,
  File "/home/luis/Documentos/VCQuechua/IMS-Toucan/TrainingInterfaces/TrainingPipelines/FastSpeech2_finetuning_example.py", line 52, in run
    train_loop(net=model,
  File "/home/luis/Documentos/VCQuechua/IMS-Toucan/TrainingInterfaces/Text_to_Spectrogram/FastSpeech2/meta_train_loop.py", line 193, in train_loop
    batch = next(train_iters[index])
  File "/home/luis/Documentos/VCQuechua/IMS-Toucan/toucan_conda_venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/luis/Documentos/VCQuechua/IMS-Toucan/toucan_conda_venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1306, in _next_data
    raise StopIteration
StopIteration

I am using batch size >=34, less than that it throws me CUDA out of memory. What should I do?