DigitalPhonetics / IMS-Toucan

Multilingual and Controllable Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart.
Apache License 2.0
1.17k stars 135 forks source link

Memory Overload from Multiple Python Instances During TTS Training: Windows 10 #169

Closed lpscr closed 1 week ago

lpscr commented 2 weeks ago

Hi, this is so amazing! I have tested many TTS reports before, but this is the first time I've seen support for so many languages. Thank you very much for all your hard work; you've made this incredibly impressive.

I tried to train a new dataset, and when I started the training in begin when create the align data stuff like understand, after a few minutes, I ran out of memory despite having 32 GB . This happened because many Python instances opened and consumed all my memory, leading to a crash. The only solution I found was to go to the task manager and manually end some Python instances. However, I don't think this is the correct solution, and I fear I may have broken something in the process.

I am using Windows 10 with an RTX 4090.i send a small video to help you better understand what's going on? , and I'm not sure why so many Python instances are opening.

i make another test also to other machine i have and same problem again

test.zip

Thank you very much for your time and any help, I really appreciate it.

Flux9665 commented 1 week ago

Hi! Yes the memory is a bit problematic for personal computers. The toolkit is designed around the server infrastructure we have at the Uni Stuttgart, since the toolkit was originally intended for teaching. We have a very slow file server, but each machine has a TB of RAM, so I just load everything to RAM and store it there for quick access. The creation of the dataset takes a ton of memory, after it is complete it will require less memory, but still needs to fit entirely into the memory. You can try splitting the dataset into smaller chunks, like I did here:

https://github.com/DigitalPhonetics/IMS-Toucan/blob/1525e60054f84ffb94aed085ebad85dbb2357dc6/TrainingPipelines/ToucanTTS_Massive_stage1.py#L56C5-L63C70

If your data doesn't fit into RAM even with this trick, then you just need to use less data, unfortunately. Making this work utilizing the disk instead of RAM is possible, but would require a pretty major modification to both the AlignerDataset and the TTSDataset classes. The number of python instances you see is just because of the multiprocessing during ALignerDataset creation to make it faster. You can reduce the amount of processes here if you want, but this will not reduce the memory requirements. There will be fewer processes, which require more RAM each.

https://github.com/DigitalPhonetics/IMS-Toucan/blob/1525e60054f84ffb94aed085ebad85dbb2357dc6/Utility/corpus_preparation.py#L16

lpscr commented 1 week ago

Thank you very much for the quick reply i try and tell you ! I have a just quick question.

I have 30 hours of speech data from 20 speakers (1-2 hours each). Is this enough to train a model? Should I use one speaker with 15 hours with more datapoint or 20 speakers with 30 hours? Which is better with less data poitns but with more speakers? The dataset consists of WAV files ranging from 2 sec min to and max 14 seconds per file wav , with a 24,000 Hz mono sampling rate. Is this the correct sampling rate?

idea speed up the train Should I fine-tune an existing model or train a new one from scratch? Is it better to train with 15 hours from one speaker, then fine-tune with multi-speaker data with 20 speaker and less data , or use all 20 speakers' data from the start?

I saw a demo here. Can I use reference audio to clone a voice and generate speech in a new language with the same accent? For example, if I clone an English voice using a reference WAV file, can the speaker use another language with the accent? If you have an example of this, it would be greatly appreciated.

Flux9665 commented 1 week ago

The sampling rate does not matter, it gets resampled automatically. The amounts of data you mention are both more than enough for single speaker and for multi-speaker. If you want to be able to exchange the speaker later, you should use the data with multiple speakers, if you don't care about that, you can also use the data with just a single speaker. Generally, if the quality< of the data is better, the resulting model will sound better. Quality of the data has the biggest impact on the final result.

Finetuning from a pretrained checkpoint is generally much much faster, your model will be done after just a few thousand steps. You can however get better quality, if you train from scratch.

And regarding your last question, yes, you can do that. I have a demo for that here: https://huggingface.co/spaces/Flux9665/IMS-Toucan

To change the accent, you can call the selection of the accent: https://github.com/DigitalPhonetics/IMS-Toucan/blob/1525e60054f84ffb94aed085ebad85dbb2357dc6/InferenceInterfaces/ToucanTTSInterface.py#L122 and the selection of the language of the text: https://github.com/DigitalPhonetics/IMS-Toucan/blob/1525e60054f84ffb94aed085ebad85dbb2357dc6/InferenceInterfaces/ToucanTTSInterface.py#L119 separately with different languages as arguments. To change the speaker, you can pass an audio into this function: https://github.com/DigitalPhonetics/IMS-Toucan/blob/1525e60054f84ffb94aed085ebad85dbb2357dc6/InferenceInterfaces/ToucanTTSInterface.py#L94

lpscr commented 1 week ago

This is so cool amazing ! I want to try, but right now I'm training, and my PC overheated.

I also want to say that when I used fine-tuning, the training was very fast, and the quality was very good. Now, I want to train from scratch to compare. I am also impressed with how fast it trains and how quickly it generates text afterward. Very nice, and the trained model is very small in the end—only 357 MB and support so many language compared to other TTS models which are much larger ,

I noticed one problem: the training crashed many times. I didn't keep the error, but it happened in utils plot_progress_spec_toucantts.

https://github.com/DigitalPhonetics/IMS-Toucan/blob/6bb437001f6c54672dc77395e6a61d0256e174bc/Utility/utils.py#L64

so i fix with except and now i can train with out crash when i see the error message i send

    try: 
      plot_code_spec(pitch, energy, sentence, durations, mel, os.path.join(save_dir, "visualization"), tf, step)
      return os.path.join(os.path.join(save_dir, "visualization"), f"{step}.png")
    except Exception as e:

      shutil.copy("C:\PythonApps\ims_toucan\error.png",os.path.join(os.path.join(save_dir, "visualization"), f"{step}.png"))
      print("error")
      print(f"An error occurred: {e}")
      return "C:\PythonApps\ims_toucan\error.png"

Also, maybe the error is because I don't use the MATLAB version you have in the requirements and use last. I need to make sure of this first.

I see the branch FlowMatchingDecoderWithVariationalProsody, but I guess it's not ready yet? I look forward to any news and would like to try it when it's available.

I made a version with TensorBoard to easily see how the training is going. here how look like

scalars image

visualized image

audio image

I forgot the sample rate you mentioned as automatic. What is the correct sample rate for the wave file? Currently, I'm using files with a sample rate of 24000. Would using 41000hz result in better outcomes,

lpscr commented 1 week ago

here the error like i say up after 26000 steps i get this error

An error occurred: PyCapsule_New called with null pointer selecting checkpoints...

in utils in function

plot_progress_spec_toucantts.

and train crash

just let you know

but i am not sure because i use last version of MATLAB if this the problem

to ignore this problem i use except Exception as e: i hope this not curse the train after , but like see

its continue create checkpoints and the file working fine

Flux9665 commented 1 week ago

For the sampling rate, as long as you use the pretrained vocoder, anything that is above 16kHz is fine. The spectrograms are extracted from 16kHz audios and then during inference upsampled to 24kHz using the vocoder.

Yes, the FlowMatchingDecoderWithVariationalProsody branch is not ready yet. It won't take much longer though, I am training large models with it now, and will make a release once I tested them. It will probably take one or two more weeks.

Regarding the error, I have never seen it and I don't understand it, but since it happens at a point in the code where files are being written, I guess it has something to do with the way you execute the python script. Since I can't reproduce it, I can't fix it, but since you fixed it with the exception, I think it's fine.

I'm closing this issue, since the original issue was resolved. Good luck with your training, I hope the models turn out good!