d8ahazard / sd_dreambooth_extension

Other
1.86k stars 281 forks source link

[Bug]: Instant OOM with 3090 Ti, with any combination of settings #1311

Closed David-337 closed 1 year ago

David-337 commented 1 year ago

Is there an existing issue for this?

What happened?

Around a month ago everything was working fine, but recently I booted up A1111 and the extension on their latest main branch commits, and I get an instant OOM whenever I try to train a model, on my 24 GB 3090 Ti, even if I use settings that worked for me a month ago.

I have tried:

Steps to reproduce the problem

  1. Fresh install A1111 and Dreambooth extension
  2. Create model
  3. Attempt to train with any settings

Commit and libraries

Initializing Dreambooth
Dreambooth revision: c2a5617c587b812b5a408143ddfb18fc49234edf
Successfully installed accelerate-0.19.0 fastapi-0.94.1 gitpython-3.1.32 transformers-4.30.2

Does your project take forever to startup?
Repetitive dependency installation may be the reason.
Automatic1111's base project sets strict requirements on outdated dependencies.
If an extension is using a newer version, the dependency is uninstalled and reinstalled twice every startup.

[+] xformers version 0.0.20 installed.
[+] torch version 2.0.1+cu118 installed.
[+] torchvision version 0.15.2+cu118 installed.
[+] accelerate version 0.19.0 installed.
[+] diffusers version 0.16.1 installed.
[+] transformers version 4.30.2 installed.
[+] bitsandbytes version 0.35.4 installed.

Command Line Arguments

--xfromers (although I also tried without)

Console logs

Initializing dreambooth training...
Pre-processing images: cropped-squares-512: : 4it [00:00, 165.72it/s]
Nothing to generate.s: cropped-squares-512: : 0it [00:00, ?it/s]                    | 0/4 [00:00<?, ?it/s]
                                                                                                         Enabling xformers memory efficient attention for unet                                | 0/4 [00:00<?, ?it/s]
Enabling xformers memory efficient attention for unet
Compiled unet
Exception importing 8bit AdamW: python3: undefined symbol: cudaRuntimeGetVersion
Traceback (most recent call last):
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/optimization.py", line 579, in get_optimizer
    from bitsandbytes.optim import AdamW8bit
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/bitsandbytes/__init__.py", line 6, in <module>
    from .autograd._functions import (
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 5, in <module>
    import bitsandbytes.functional as F
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/bitsandbytes/functional.py", line 13, in <module>
    from .cextension import COMPILED_WITH_CUDA, lib
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/bitsandbytes/cextension.py", line 113, in <module>
    lib = CUDASetup.get_instance().lib
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/bitsandbytes/cextension.py", line 109, in get_instance
    cls._instance.initialize()
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/bitsandbytes/cextension.py", line 59, in initialize
    binary_name, cudart_path, cuda, cc, cuda_version_string = evaluate_cuda_setup()
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/bitsandbytes/cuda_setup/main.py", line 125, in evaluate_cuda_setup
    cuda_version_string = get_cuda_version(cuda, cudart_path)
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/bitsandbytes/cuda_setup/main.py", line 45, in get_cuda_version
    check_cuda_result(cuda, cudart.cudaRuntimeGetVersion(ctypes.byref(version)))
  File "/usr/lib64/python3.10/ctypes/__init__.py", line 387, in __getattr__
    func = self.__getitem__(name)
  File "/usr/lib64/python3.10/ctypes/__init__.py", line 392, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: python3: undefined symbol: cudaRuntimeGetVersion
python3: undefined symbol: cudaRuntimeGetVersion
WARNING: Using default optimizer (AdamW from Torch)
                                                                                                         Found 0 reg images. 0%|                                                              | 0/4 [00:00<?, ?it/s]
Preparing dataset...
Init dataset!
Preparing Dataset (With Caching)
                                                                                                         Loading cached latents...|                                                           | 0/4 [00:00<?, ?it/s]
Bucket 0 (512, 512, 0) - Instance Images: 4 | Class Images: 0 | Max Examples/batch: 4
Total Buckets 1 - Instance Images: 4 | Class Images: 0 | Max Examples/batch: 4                            

Total images / batch: 4, total examples: 4████████████████████████████████| 4/4 [00:00<00:00, 8490.49it/s]
                                                                                                         Total dataset length (steps): 4                                                                            
                  Initializing bucket counter!
Steps:   0%|                                                                      | 0/400 [00:00<?, ?it/s]OOM Detected, reducing batch/grad size to 0/1.
Traceback (most recent call last):
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/memory.py", line 119, in decorator
    return function(batch_size, grad_size, prof, *args, **kwargs)
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/train_dreambooth.py", line 1355, in inner_loop
    optimizer.step()
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/accelerate/optimizer.py", line 140, in step
    self.optimizer.step(closure)
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/torch/optim/optimizer.py", line 33, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/torch/optim/adamw.py", line 171, in step
    adamw(
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/torch/optim/adamw.py", line 321, in adamw
    func(
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/torch/optim/adamw.py", line 566, in _multi_tensor_adamw
    denom = torch._foreach_add(exp_avg_sq_sqrt, eps)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 114.00 MiB (GPU 0; 23.69 GiB total capacity; 22.18 GiB already allocated; 70.50 MiB free; 22.49 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Steps:   0%|                                                                      | 0/400 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/ui_functions.py", line 729, in start_training
    result = main(class_gen_method=class_gen_method)
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/train_dreambooth.py", line 1548, in main
    return inner_loop()
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/memory.py", line 117, in decorator
    raise RuntimeError("No executable batch size found, reached zero.")
RuntimeError: No executable batch size found, reached zero.
Restored system models.
Duration: 00:00:04

Additional information

I'm using Fedora 38 with CUDA 11.8 and Nvidia drivers 535 (although I also tried downgrading to 525)

grantrosario commented 1 year ago

Also having same issue with RTX 2080 ti -- 11gb vram

David-337 commented 1 year ago

Update: solved by fully nuking Nvidia drivers, CUDA toolkit and any other nvidia or cuda related packages from the system. Then reinstalling latest nvidia from rpmfusion and cuda 11.8 from https://developer.download.nvidia.com/compute/cuda/repos/fedora35/x86_64/.