d8ahazard / sd_dreambooth_extension

Other
1.86k stars 282 forks source link

[Bug]: Completely unable to train any LORA with CUDA out of memory error #1457

Closed daszzzpg closed 6 months ago

daszzzpg commented 8 months ago

Is there an existing issue for this?

What happened?

I was trying to use A1111 dreambooth extension to train a SDXL model but failed (4070TI 12GB) Originally it is super slow, and I searched the internet and closed NVDIA's memory system fullback optin Then It shows a CUDA ot of memory error like that

However when I switched to a sd1.5 model it still gives me this!

Steps to reproduce the problem

  1. Pick either a SD1.5 or SDXL model
  2. Create
  3. Train
  4. Error

Commit and libraries

Starting at Initializing Dreambooth and ending several lines below at [+] bitsandbytes version 0.35.4 installed..

Command Line Arguments

set COMMANDLINE_ARGS=--no-gradio-queue --no-half-vae --xformers --medvram

Console logs

OM Detected, reducing batch/grad size to 0/2.█████████████▊         | 4/5 [00:00<00:00,  5.44it/s]
Traceback (most recent call last):
  File "G:\AI\SDNEW\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\memory.py", line 126, in decorator
    return function(batch_size, grad_size, prof, *args, **kwargs)
  File "G:\AI\SDNEW\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 477, in inner_loop
    unet.to(accelerator.device, dtype=weight_dtype)
  File "G:\AI\SDNEW\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1145, in to
    return self._apply(convert)
  File "G:\AI\SDNEW\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
    module._apply(fn)
  File "G:\AI\SDNEW\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
    module._apply(fn)
  File "G:\AI\SDNEW\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "G:\AI\SDNEW\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 820, in _apply
    param_applied = fn(param)
  File "G:\AI\SDNEW\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.99 GiB total capacity; 10.95 GiB already allocated; 0 bytes free; 11.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Loading unet...:  80%|████████████████████████████████████▊         | 4/5 [00:02<00:00,  1.82it/s]
Traceback (most recent call last):
  File "G:\AI\SDNEW\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\ui_functions.py", line 735, in start_training
    result = main(class_gen_method=class_gen_method)
  File "G:\AI\SDNEW\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 1976, in main
    return inner_loop()
  File "G:\AI\SDNEW\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\memory.py", line 124, in decorator
    raise RuntimeError("No executable batch size found, reached zero.")
RuntimeError: No executable batch size found, reached zero.

Additional information

No response

d8ahazard commented 8 months ago

Yeah, there's really not a lot I can do about running out of vram.

On Wed, Feb 7, 2024, 8:32 AM daszzzpg @.***> wrote:

Is there an existing issue for this?

  • I have searched the existing issues and checked the recent builds/commits of both this extension and the webui

What happened?

I was trying to use A1111 dreambooth extension to train a SDXL model but failed (4070TI 12GB) Originally it is super slow, and I searched the internet and closed NVDIA's memory system fullback optin Then It shows a CUDA ot of memory error like that

However when I switched to a sd1.5 model it still gives me this! Steps to reproduce the problem

  1. Pick either a SD1.5 or SDXL model
  2. Create
  3. Train
  4. Error

Commit and libraries

Starting at Initializing Dreambooth and ending several lines below at [+] bitsandbytes version 0.35.4 installed.. Command Line Arguments

set COMMANDLINE_ARGS=--no-gradio-queue --no-half-vae --xformers --medvram

Console logs

OM Detected, reducing batch/grad size to 0/2.█████████████▊ | 4/5 [00:00<00:00, 5.44it/s] Traceback (most recent call last): File "G:\AI\SDNEW\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\memory.py", line 126, in decorator return function(batch_size, grad_size, prof, *args, **kwargs) File "G:\AI\SDNEW\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 477, in inner_loop unet.to(accelerator.device, dtype=weight_dtype) File "G:\AI\SDNEW\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1145, in to return self._apply(convert) File "G:\AI\SDNEW\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply module._apply(fn) File "G:\AI\SDNEW\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply module._apply(fn) File "G:\AI\SDNEW\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply module._apply(fn) [Previous line repeated 2 more times] File "G:\AI\SDNEW\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 820, in _apply param_applied = fn(param) File "G:\AI\SDNEW\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1143, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.99 GiB total capacity; 10.95 GiB already allocated; 0 bytes free; 11.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Loading unet...: 80%|████████████████████████████████████▊ | 4/5 [00:02<00:00, 1.82it/s] Traceback (most recent call last): File "G:\AI\SDNEW\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\ui_functions.py", line 735, in start_training result = main(class_gen_method=class_gen_method) File "G:\AI\SDNEW\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 1976, in main return inner_loop() File "G:\AI\SDNEW\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\memory.py", line 124, in decorator raise RuntimeError("No executable batch size found, reached zero.") RuntimeError: No executable batch size found, reached zero.

Additional information

No response

— Reply to this email directly, view it on GitHub https://github.com/d8ahazard/sd_dreambooth_extension/issues/1457, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMO4NDDTR7TMWM7PFLKA2DYSOGA7AVCNFSM6AAAAABC56AARSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEZDGMJXGIYDMOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

github-actions[bot] commented 7 months ago

This issue is stale because it has been open for 14 days with no activity. Remove stale label or comment or this will be closed in 30 days