CUDA out of memory. (RTX 4070 super)

KuShiro189 commented 7 months ago

🔴 If you have installed AllTalk in a custom Python environment, I will only be able to provide limited assistance/support. AllTalk draws on a variety of scripts and libraries that are not written or managed by myself, and they may fail, error or give strange results in custom built python environments.

🔴 Please generate a diagnostics report and upload the "diagnostics.log" as this helps me understand your configuration.

diagnostics.log

Describe the bug CUDA out of memory on any batch size even on batch size 1 (RTX 4070 super)

To Reproduce here are the parameters i attempted (every single of them would return CUDA out of memory): -the default set -32 epoch, 16 batch size, 1 grad acc steps, 16 max permitted size of audio -24 epoch, 8 batch size, 1 grad acc steps, 8 max permitted size of audio -16 epoch, 2 batch size, 1 grad acc steps, 8 max permitted size of audio -8 epoch, 4 batch size, 2 grad acc steps, 8 max permitted size of audio -8 epoch, 4 batch size, 1 grad acc steps, 8 max permitted size of audio (the screenshot) -2 epoch, 1 batch size, 1 grad acc steps, 4 max permitted size of audio

Screenshots

Text/logs Traceback (most recent call last): File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\trainer\trainer.py", line 1833, in fit self._fit() File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\trainer\trainer.py", line 1785, in _fit self.train_epoch() File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\trainer\trainer.py", line 1504, in trainepoch outputs, = self.train_step(batch, batch_num_steps, cur_step, loader_start_time) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\trainer\trainer.py", line 1360, in train_step outputs, loss_dict_new, step_time = self.optimize( ^^^^^^^^^^^^^^ File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\trainer\trainer.py", line 1288, in optimize optimizer.step() File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\torch\optim\lr_scheduler.py", line 75, in wrapper return wrapped(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\torch\optim\optimizer.py", line 385, in wrapper out = func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\torch\optim\optimizer.py", line 76, in _use_grad ret = func(self, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\torch\optim\adamw.py", line 187, in step adamw( File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\torch\optim\adamw.py", line 339, in adamw func( File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\torch\optim\adamw.py", line 608, in _multi_tensor_adamw exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 11.99 GiB of which 1.57 GiB is free. Of the allocated memory 7.50 GiB is allocated by PyTorch, and 186.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\AI\text-generation-webui-main\extensions\alltalk_tts\finetune.py", line 1376, in train_model config_path, original_xtts_checkpoint, vocab_file, exp_path, speaker_wav = train_gpt(language, num_epochs, batch_size, grad_acumm, train_csv, eval_csv, output_path=str(output_path), max_audio_length=max_audio_length) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\AI\text-generation-webui-main\extensions\alltalk_tts\finetune.py", line 617, in train_gpt trainer.fit() File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\trainer\trainer.py", line 1860, in fit remove_experiment_folder(self.output_path) File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\trainer\generic_utils.py", line 77, in remove_experiment_folder fs.rm(experiment_path, recursive=True) File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\fsspec\implementations\local.py", line 185, in rm shutil.rmtree(p) File "C:\AI\text-generation-webui-main\installer_files\env\Lib\shutil.py", line 787, in rmtree return _rmtree_unsafe(path, onerror) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\AI\text-generation-webui-main\installer_files\env\Lib\shutil.py", line 634, in _rmtree_unsafe onerror(os.unlink, fullname, sys.exc_info()) File "C:\AI\text-generation-webui-main\installer_files\env\Lib\shutil.py", line 632, in _rmtree_unsafe os.unlink(fullname) PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:/AI/text-generation-webui-main/extensions/alltalk_tts/finetune/tmp-trn/training/XTTS_FT-April-30-2024_10+11PM-ea551d3\trainer_0_log.txt'

Desktop (please complete the following information): AllTalk was updated: 3/18/2024 Custom Python environment: text generation webUI's python environment, but i've attempted on my local python environment aswell and they returned the same error Text-generation-webUI was updated: 3/11/2024

Additional context seems like regardless of what parameters i set, it will always try to utilize the entire 12GB VRAM ignoring the 0.5-1GB used by other programs. i'd like to specifically know how did you pulled it off in your 4070 aswell if possible,thanks!

erew123 commented 7 months ago

Hi @KuShiro189

On the Ram & VRAM tab, have you checked the link to make sure the Nvidia Stable Diffusion memory settings aren't disabled? https://nvidia.custhelp.com/app/answers/detail/a_id/5490/~/system-memory-fallback-for-stable-diffusion

This setting allows Windows machines to extend their VRAM memory into System RAM if needed. If its been turned off, you can only use the 12GB of VRAM that you have.

Thanks

KuShiro189 commented 7 months ago

appreciated the quick response!

i had not used stable diffusion before, and have not set anything about the memory settings i did attempted to set both the python.exe in the text gen webUI env and the global nvidia settings to have the memory settings on, and restarted the finetune.py script

but the finetune script still gives the following error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB. GPU 0 has a total capacity of 11.99 GiB of which 4.93 GiB is free. Of the allocated memory 5.69 GiB is allocated by PyTorch, and 119.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

on a side note, on my RAM & VRAM tab, the GPU information and vram did not show up and only shows the system RAM, despite that my GPU is still being utilized by the finetune script as far as i've seen it in task manager. perhaps the issues might be about it? edit: maybe i should try updating my alltalk_tts?

sorry for taking long to reply, my PC crashed during another attempt like above

erew123 commented 7 months ago

Hi @KuShiro189

The article is called "Stable Diffusion memory fallback" by Nvidia, though the actual setting is "CUDA - Sysmem Fallback Policy" and it changes the way the Nvidia driver works with memory allocation. They should have called it something better and less confusing.

If you haven't, I would 110% suggest you check if that setting has been changed by something else as other applications can change it.

Its the only setting that I know of for Windows that will have any effect on the VRAM memory allocation.

I've just run through a finetuning process to confirm all is working. The default behaviour you should see when it runs is that as the 1st Epoch comes to an end, with a 12GB GPU, the Shared memory should start to increase. This is the "CUDA - Sysmem Fallback Policy" in operation, allowing the GPU to use System RAM when it runs out of memory.

It only uses that memory for maybe 30-60 seconds as it stores 3 x 5GB copies of the AI model in there as it shifts things around before saving 2x of them off to disk. When it saves them off to disk, it releases that memory, so you see both your VRAM and Shared memory drop again:

As you will note, those screenshots are both running on a RTX 4070 with 12GB VRAM.

And the only secret to making it run, that I know of, is to "CUDA - Sysmem Fallback Policy" as without that, your Nvdia driver will limit CUDA operations to the 12GB VRAM built into your GPU.

So can you confirm you have checked that setting to see if its been disabled OR indeed set it to "Prefer Sysmem Fallback" to see if that changes things for you.

Thanks

KuShiro189 commented 7 months ago

yep, i activated it but it did not seem to overflow to the system memory (can be seen on the peak VRAM usage where instead of overflowing it into the system RAM, it returns an error)

i'll give a try on updating the alltalk tts and see if anything changes

erew123 commented 7 months ago

Hi @KuShiro189

The only thing I can add as a thought, is that I dont know how that value is activated/passed over to Python's Pytorch CUDA environment, meaning that, when you change the setting, you will probably have to open a new command prompt and load a fresh Python environment. Using an already open command prompt may not carry over the new setting, but I cant say for certain as Ive not looked into its behaviour in that kind of detail.

RenNagasaki commented 7 months ago

@KuShiro189 what I learned is that your System(C:) Partition should have atleast 20GB of Space. If that runs out, OOM error seems to occur aswell.

KuShiro189 commented 7 months ago

thank you both for the input! and sorry for taking quite long to respond, it was midnight when i opened this issue

i did attempted to start a new CMD and python environment after i have set the nvidia settings, and still no luck, my GPU seem to refuse overflowing the memory into the RAM i updated both alltalk tts and my nvidia driver and restarted my entire computer still no luck. im going to check the system variables and the BIOS settings aswell

also i have 121GB free on SSD for now, shouldn't be the problem there

my thoughts are that my GPU somehow refuses to overflow its memory to system RAM due to either a factory setting prevented it or something in the BIOS or system prevented it. im going to check everything for a while

once again thank you for both of your time!

KuShiro189 commented 7 months ago

no luck yet ;-; my GPU for some reason just did not want to overflow its memory to system RAM (even though i have plenty space in RAM) regardless of the setting on the Nvidia control panel. not sure why but i'd like to experiment more on this by trying purposefully loading something massive in my GPU to troubleshoot the problem

in case of this keep goes on, perhaps there are other way to do this? probably anyone with a good GPU can help me finetune the model with my dataset?

erew123 commented 7 months ago

@KuShiro189 The CUDA - Sysmem Fallback Policy is applied at the Nvidia driver level, so (as far as I understand) there are no other settings that would impact this working. Though as mentioned, other Windows applications can send instructions to the Nvidia driver, but, if you have set Prefer Sysmem Fallback that should force the setting on and nothing should be able to over-ride that.

So as far as AllTalk's code goes, it just accesses the VRAM memory via Python. AllTalk sends requests to Python and AllTalk has no concept/access to control CUDA - Sysmem Fallback Policy or the Nvidia driver/memory allocation settings. All the AllTalk and fintetune script does, is request that something be stored within the VRAM or removed from the VRAM. There is no clever anything beyond that.

Python itself doesn't have that level of control either, which is why its back to the Nvidia driver to extend or not extend into System Ram. This setting is also ONLY available on Windows systems and CUDA - Sysmem Fallback Policy is not available on Linux. I assume you aren't running text-gen-webui and AllTalk through Windows Subsystem for Linux (WSL)? as I note here that the setting doesn't pass over to WSL https://github.com/microsoft/WSL/issues/11050 (according to the people whom wrote that on the MS github). Im pretty sure you arent using WSL, based on your diagnostics file, but I could be wrong.

The only suggestions I have at this point are:

1) Ensure CUDA - Sysmem Fallback Policy is set to Prefer Sysmem Fallback and nothing is resetting it. So check the setting after a reboot.

2) I can in no way see how the text-gen-webui's Python environment could be impacting it, but you could setup AllTalk as a standalone copy elsewhere on your system and see if that suffers the same issue. Running as a standalone installation will build its own Python environment separate to that of Text-gen-webui and would negate any issues with the text-gen-webui Python environment. aka, its a more controlled environment.

3) Im not sure what the largest LLM you have is, but in theory, if you load a 13b model into your VRAM and then load AllTalk without "Low VRAM" enabled, it should extend the AllTalk AI model into your Shared RAM, as long as the setting above is set. There are reasons people may or may not set that setting within the text-gen-webui environment, depending on which loaders they are using e.g. https://github.com/oobabooga/text-generation-webui/discussions/5784 however as far as I am aware, there are no ways Text-gen-webui (or the LLM) loaders change this or any related.

I've had a general hunt of the internet and I cant think of or see any other routes to try diagnose/resolve this. For various reasons I had to run about 8 finetuning sessions yesterday on the current finetune code from github and I didn't encounter the out of memory issue once, everything behaved as expected. The only real difference between your system and mine was that you are on Windows 10 and Im on 11, which shouldn't make the slightest bit of difference. And I was on 2 Nvidia driver versions later than yourself, but again, that shouldnt make a difference and there were no bug fixes relating to memory management between the versions of the driver.

I can only suggest try the above 3 things. Other than that, I am stumped for what to try or further things to suggest.

Thanks

KuShiro189 commented 7 months ago

thank you so much for the time! very appreciated that! its all good on my end for now :)

erew123 / alltalk_tts

CUDA out of memory. (RTX 4070 super) #195