erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
816 stars 91 forks source link

Crazy VRAM Usage | Out of CUDA Memory 4090, regardless of batch size. #189

Closed RenNagasaki closed 4 months ago

RenNagasaki commented 4 months ago

🔴 If you have installed AllTalk in a custom Python environment, I will only be able to provide limited assistance/support. AllTalk draws on a variety of scripts and libraries that are not written or managed by myself, and they may fail, error or give strange results in custom built python environments.

🔴 Please generate a diagnostics report and upload the "diagnostics.log" as this helps me und diagnostics.log

Describe the bug Out of CUDA Memory, regardless the batch size. I've got an 4090

To Reproduce I start a fine tuning process with 12, 16, 20, 24 or 32 Batch size. The moment even just 1 Gbit of VRAM is already in use by just plain windows, I at some point will get the out of cuda memory error. To me it sounds like it simply thinks it has all 24GB of VRAM to use, but some will always be in use by something else.

Screenshots Bild_2024-04-26_112328443 Bild_2024-04-26_112451024

Text/logs Will add a log the next time I get this error.

Desktop (please complete the following information): AllTalk was updated: 25.04.2024 Custom Python environment: no Text-generation-webUI was updated: Standalone

Additional context To me it seems like that the batch size has no influence at all for the VRAM used. Started via start_finetune.bat

Kind regards.

erew123 commented 4 months ago

Hi @RenNagasaki

Frustratingly I don't personally have a large enough GPU with that much VRAM to test in that way. However yes, Windows will use VRAM for other purposes, so preferably using the system/vram just for finetuning at that point in time would yield best results.

That aside, many of the scripts called upon were written by Coqui and then obviously that spreads further out into transformers etc so on and so forth. Typical peak memory use I have seen with the finetuning process is around 14-16GB's, however, again I guess there can be variance based on the amount of sample files you are throwing at it.

I assume this is on Step 2 that you are having the issue. I'm happy to take a look at the error if you can provide one and see if there is something I can optimize. I would suggest that you open Task Manager in the background when finetuning and set it on the Performance > GPU so that the Dedicated GPU memory usage is populating. That would at least give us something to go on if you can catch it when it errors out e.g. we would be able to see if it has actually filled out your whole 24GB of RAM. It would be handy to see the log, know how many epochs it had reached into the training, and a general roundup of the settings you have used (inc how many audio samples in MB/GB).

Thanks

RenNagasaki commented 4 months ago

Hi @erew123

Thanks for responding so fast. I added a screenshot showing the GPU usage and my Step 2 Settings. As you can see its getting completely utilized at Batch Size 12. I'm using Around 170 Audio files as samples with a size of 84 MB all together. (They all are like 1 to 2 sentences long)

erew123 commented 4 months ago

Hi @RenNagasaki

Thanks! I see that its extending well over into your System RAM, which is how I get away with a 12GB card on a 4 epoch training (needing the 14-16GB I mentioned). It may just well be the coqui scripts and specific to how it trains this model.

But if you can throw me a log error/crash at some point from the console dump, it would allow me to try pinpoint which script/area is failing and see if I can some way there's anything to change or if this is just a case of having to lower the amount of epochs.

Thanks

RenNagasaki commented 4 months ago

Hey @erew123 finetune.log Here a log of the crash.

RenNagasaki commented 4 months ago

Do you know of a way to tell it to just like only use 22GB of VRAM? Would it be possible to add such a slider to Step 2? Like: 1GB------Max VRAM My hardware should be more than powerful enough to do this with a batch size of 12.

erew123 commented 4 months ago

Hi @RenNagasaki

So Ive had a hunt through AllTalk's code and also the Coqui main training script here

Ill start by saying there is no way to just control memory use, the scripts all the way down to transformers have no functionality for doing this.


RE: "My hardware should be more than powerful enough to do this with a batch size of 12." I cant comment on that, this is Coqui's scripts and they aren't the same as training a LLM model, nor can I say how well optimised the scripts are.


I may be able to introduce an option for "automatic mixed precision (AMP) training" which may alleviate memory use by up to 50% at best, however, there are other possible knock on effects here which I'm not currently sure as to how that could affect the final model... I will have to look into this more.


Another option is to change the gradient accumulation (in the interface). Hoping to simplify the explanation of this I asked ChatGPT to give a cleaner sounding answer:

Gradient accumulation is a technique that allows you to simulate a larger batch size without actually increasing the memory consumption. Here's how it works:

  1. Instead of updating the model's parameters after every batch, gradient accumulation enables you to process multiple batches and accumulate their gradients.
  2. By setting the grad_accum_steps to a value greater than 1, you can process multiple batches before performing a single optimization step. For example, if grad_accum_steps is set to 4, the gradients will be accumulated over 4 batches before updating the model's parameters.
  3. This means that you can effectively use a larger batch size while keeping the actual batch size per iteration smaller, thereby reducing the memory footprint.
  4. Increasing the grad_accum_steps allows you to find a balance between memory consumption and computational efficiency. You can process more examples per optimization step without exceeding the available memory.

However, it's important to note that increasing the gradient accumulation steps does have an impact on the training process:

  1. Since the model's parameters are updated less frequently (every grad_accum_steps batches), the training dynamics may be slightly different compared to updating the model after every batch.
  2. You may need to adjust the learning rate and other hyperparameters accordingly to compensate for the less frequent updates. Typically, you can increase the learning rate slightly when using gradient accumulation (will be adding learning rate soon).
  3. The training progress will be slower in terms of the number of optimization steps per epoch, as the model updates occur less frequently. However, the overall training time may still be reduced compared to using a smaller batch size without gradient accumulation, as it allows for better utilization of GPU resources. To start, you can try setting grad_accum_steps to a value like 4 and see if it resolves the OOM error. If the error persists, you can experiment with higher values until you find a balance that works for your specific setup.

So maybe you would like to try increasing the gradient accumulation steps and see how that affects your memory?

Thanks

RenNagasaki commented 4 months ago

Mhhhm. Well sounds to me like I'll just accept that it crashes sometimes because I do not want to reduce the quality of the training. The 3rd try went through smoothly without any other settings changed.

Thanks for your timely help in this!

erew123 commented 4 months ago

No probs. I will be posting an update to the finetuning in a short while (an hour or so if all goes well) which will help improve a few bits. You may want to hang back a short while on training and get the update when I post it out.

Thanks

RenNagasaki commented 4 months ago

@erew123 while I have you here. Do you have any experience with the significance of the grammer in the metadata_eval and train files? The text generated inside there is kinda wrong in some places. Should I fix that before fine tuning or leave as is?

For example in German it sometimes reads "Sylphen" as "Söwen"

erew123 commented 4 months ago

If its autogenerated from Step 1 (Whisper) and you want it 100%, yes its probably best to. Whisper is a best effort to automate it, but its never 100% e.g. with English when its picking there, their, and they’re, it will choose one that it thinks is correct and put that into the training data as the example to use for the sound its listening to, so will train it that way and so you wont get as cleaner result when using that text in TTS generation later down the line.

RenNagasaki commented 4 months ago

Ahhh, thought so. Thanks for clarifying. Still getting the hang of it.

erew123 commented 4 months ago

The update is up. You may want to git pull. It gives a few extra options.

Thanks

RenNagasaki commented 4 months ago

Just updated, thanks! @erew123 If I may ask with your 4070, which settings do you use normally?

erew123 commented 4 months ago

Huh, well to be honest, since about 3 weeks after actually starting wring code for AllTalk I never get time to actually use it anymore. My LLM use and other such things has dropped off a cliff.

When I have used it though, Ive just stuck with the defaults and they have been fine for the training Ive wanted to do. 10 minutes of audio takes about 20 minutes to train with 10 epochs at the default settings and the result has been fine.

RenNagasaki commented 4 months ago

Okay. Now I'm getting confused. I thought higher Batch size meant faster and more accurate training? I have trained with 2 different voices now. One has like 7-8 minutes of samples which takes like 1+ hours. The other has like 16 minutes and trains well over 2 hours. With everything on default except batch size.

erew123 commented 4 months ago

I can only think the difference Im on about here, is when I say 10 minutes of audio, I mean 10 minutes before its split it down into individual wav files. That could be the difference, perhaps. as my 10 minutes may get broken into actually 3 minutes of total audio (individual wav files) after Step 1. Its been that long I havnt actually checked.

Typically I find each epoch stage takes the same amount of time, so if you are doing 1 or 10 epochs of however much audio you have, it is pretty linear e.g. 10 epochs will take 10x what 1 epoch would take... But I would have to run a whole training session at some point to narrow down the precise amount of audio in the spilt out wav files (in seconds) and then see how long that takes for 1x epoch.

So do you have 7-8 minutes of actual audio when you look in the \alltalk_tts\finetune\tmp-trn\wavs folder? Is that what you mean?

RenNagasaki commented 4 months ago

I can only think the difference Im on about here, is when I say 10 minutes of audio, I mean 10 minutes before its split it down into individual wav files. That could be the difference, perhaps. as my 10 minutes may get broken into actually 3 minutes of total audio (individual wav files) after Step 1. Its been that long I havnt actually checked.

Typically I find each epoch stage takes the same amount of time, so if you are doing 1 or 10 epochs of however much audio you have, it is pretty linear e.g. 10 epochs will take 10x what 1 epoch would take... But I would have to run a whole training session at some point to narrow down the precise amount of audio in the spilt out wav files (in seconds) and then see how long that takes for 1x epoch.

Okay, thats weird. I have some epoches take 5 minutes, other over 30

So do you have 7-8 minutes of actual audio when you look in the \alltalk_tts\finetune\tmp-trn\wavs folder? Is that what you mean?

Jeah, actual audio, split into 1 or 2 sentence long files.

RenNagasaki commented 4 months ago

@erew123 is there a maximum of audio data you'd recommend for training?

erew123 commented 4 months ago

Hi @RenNagasaki

Hah, well.... Here are the Coqui docs on datasets and training:

https://docs.coqui.ai/en/latest/what_makes_a_good_dataset.html#what-makes-a-good-dataset

https://docs.coqui.ai/en/latest/faq.html

That aside, I've never seen a definitive answer. I know if you are training an entirely new language you need about 4 hours of audio and 1000 epochs. So definitely less than that.

These are my 2x files I use for testing:

image

The larger file gets broken into about 110 wav files and the smaller into about 75 wav files. And this would be the one I suggested takes about 20 minutes to process.

Its a human voice, in English, and with 10 epochs it trains quite well, aka it sounds pretty good after those 10 epochs, though different samples you select when you generate can result in slightly different quality of results. I assume because its similar to other human voices that it can already generate.

If you are training say a cartoon characters voice, you may want to use more samples as it may not quite sound like a humans voice and you may need longer training.

I don't think you need a huge dataset, more a varied one in terms of how a person emotes, pauses during speech etc.

I have 2x other things that I have thought about since our earlier messages:

1) EDIT - I TAKE THIS BACK, This one was only for Whisper and not the actual training portion I remembered that actual training SHOULD be on Float16 and not Float32 (originally). So I am going to add Float16 back as the default with a checkbox for Float32. The only reason for the Float32 was compatibility with other RTX cards. It was introduced in this pull request https://github.com/erew123/alltalk_tts/pull/114 and I didnt consider the potential memory impact at the time. So I will set Float16 default (Which should reduce memory overhead) and leave a checkbox for Float32. Ill try get that done shortly.

2) I believe you mentioned you get some larger WAV files after the splitting on Step 1 e.g. 2 minutes or longer (in the wavs folder). Im wondering if these might be what cause some of your Epoch's to take longer than others. When training is running, it can only use the maximum length of audio specified in the settings (Step 2) so to achieve that it has to truncate larger wavs and select a section from them. It may be worth manually splitting your larger audio files down and updating the training CSV files and seeing if that improves your performance.

RenNagasaki commented 4 months ago

@erew123 do you have any experience regarding realtime(streaming) generation on something like a GTX 1650?

erew123 commented 4 months ago

Not specifically no. But you mean the 4GB card. I may/may not be able to give you an answer, depending on what the technical question is.

RenNagasaki commented 4 months ago

I'm just trying to figure out what kind of GPU is needed to generate text in realtime for end users.. Not Train.

erew123 commented 4 months ago

Well, it should generate TTS, how quickly I cannot say. Obviously DeepSpeed will be a benefit, and if you are doing much of anything else on that GPU, say loading in a LLM or lots of graphics, then I would use the low VRAM mode to shift the TTS model in and out of the GPU memory on the fly, however, that could (on very old systems) introduce some latency with generation, depending on how well their PCI bus is working and how fast their System RAM is. Certainly I know people have used AllTalk on those kinds of cards, but I have no idea how fast it is.

The main issue I have seen with older systems is load time of the XTTS model from mechanical hard drives. Ive had people on 8+ year old systems with mechanical hard drives (and I assume older RAM of course), saying it takes up to 2 minutes to load in. Which I suspect is also because they have someone in their GPU VRAM currently, which as a LLM and on Windows you have the Nvidia Driver behaviour where it extends your GPU RAM into System RAM, so its extra slow as it figures out how it wants to shift the layers of 2x AI models around. https://nvidia.custhelp.com/app/answers/detail/a_id/5490/~/system-memory-fallback-for-stable-diffusion

Again, when I introduce something like Piper into the v2 of AllTalk, that could be a fallback option for a slower GPU (if necessary).

RenNagasaki commented 4 months ago

But piper isnt compatible to coqui or is it? As far as I understand piper uses VITS -> Output as Onnx. But Coqui uses xtts. Or is there a way I could convert the xtts model to onnx for piper to use?

erew123 commented 4 months ago

As far as I know there will be no way to convert a model to piper.

What I am on about is there will be the option to use/install different actual TTS generation engines within AllTalk. So you could tell it to use XTTS models, or Piper models, or potentially other models (depending on how easy I manage to find the coding of it).

So the API calls to AllTalk will remain the same, but the underlying TTS engine can be swapped out to use something else.

So what Im getting at is, you wouldnt have to re-code as such, you could use another engine within AllTalk as a fallback option if necessary... but this is a little way off at the moment as Ive not started that portion of coding yet.

RenNagasaki commented 4 months ago

Jeah, that actually sounds awesome, but my experience with piper is that the trained models are extremely subpar when you have not that many voice samples.

erew123 commented 4 months ago

Sure, Im just picking Piper as it should be an easy one to integrate and figure out the basics. I actually have a list of about 20 TTS engines to add.... if I can get it working correctly. Though its a complicated task as theres a lot to consider with handling downloading/setting up of each engine, voices, generation requests etc. So starting with a simple well documented one was my plan.

RenNagasaki commented 4 months ago

Jeah, I can totally understand that.

I'll try an older nvidia card myself soon to see how it fares.