[Bug]: Instant OOM with 3090 Ti, with any combination of settings

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits of both this extension and the webui

What happened?

Around a month ago everything was working fine, but recently I booted up A1111 and the extension on their latest main branch commits, and I get an instant OOM whenever I try to train a model, on my 24 GB 3090 Ti, even if I use settings that worked for me a month ago.

I have tried:

Reverting back to the commits that worked for me a month ago
Using dev branches of the extension and A1111
Fresh installing A1111 and the extension
Fresh installing CUDA and my drivers
Downgrading Nvidia drivers from 535 to 525
All sorts of combinations of mixed precision and memory attention

Steps to reproduce the problem

Fresh install A1111 and Dreambooth extension
Create model
Attempt to train with any settings

Commit and libraries

Initializing Dreambooth
Dreambooth revision: c2a5617c587b812b5a408143ddfb18fc49234edf
Successfully installed accelerate-0.19.0 fastapi-0.94.1 gitpython-3.1.32 transformers-4.30.2

Does your project take forever to startup?
Repetitive dependency installation may be the reason.
Automatic1111's base project sets strict requirements on outdated dependencies.
If an extension is using a newer version, the dependency is uninstalled and reinstalled twice every startup.

[+] xformers version 0.0.20 installed.
[+] torch version 2.0.1+cu118 installed.
[+] torchvision version 0.15.2+cu118 installed.
[+] accelerate version 0.19.0 installed.
[+] diffusers version 0.16.1 installed.
[+] transformers version 4.30.2 installed.
[+] bitsandbytes version 0.35.4 installed.

Command Line Arguments

--xfromers (although I also tried without)

Console logs

Initializing dreambooth training...
Pre-processing images: cropped-squares-512: : 4it [00:00, 165.72it/s]
Nothing to generate.s: cropped-squares-512: : 0it [00:00, ?it/s]                    | 0/4 [00:00<?, ?it/s]
                                                                                                         Enabling xformers memory efficient attention for unet                                | 0/4 [00:00<?, ?it/s]
Enabling xformers memory efficient attention for unet
Compiled unet
Exception importing 8bit AdamW: python3: undefined symbol: cudaRuntimeGetVersion
Traceback (most recent call last):
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/optimization.py", line 579, in get_optimizer
    from bitsandbytes.optim import AdamW8bit
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/bitsandbytes/__init__.py", line 6, in <module>
    from .autograd._functions import (
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 5, in <module>
    import bitsandbytes.functional as F
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/bitsandbytes/functional.py", line 13, in <module>
    from .cextension import COMPILED_WITH_CUDA, lib
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/bitsandbytes/cextension.py", line 113, in <module>
    lib = CUDASetup.get_instance().lib
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/bitsandbytes/cextension.py", line 109, in get_instance
    cls._instance.initialize()
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/bitsandbytes/cextension.py", line 59, in initialize
    binary_name, cudart_path, cuda, cc, cuda_version_string = evaluate_cuda_setup()
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/bitsandbytes/cuda_setup/main.py", line 125, in evaluate_cuda_setup
    cuda_version_string = get_cuda_version(cuda, cudart_path)
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/bitsandbytes/cuda_setup/main.py", line 45, in get_cuda_version
    check_cuda_result(cuda, cudart.cudaRuntimeGetVersion(ctypes.byref(version)))
  File "/usr/lib64/python3.10/ctypes/__init__.py", line 387, in __getattr__
    func = self.__getitem__(name)
  File "/usr/lib64/python3.10/ctypes/__init__.py", line 392, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: python3: undefined symbol: cudaRuntimeGetVersion
python3: undefined symbol: cudaRuntimeGetVersion
WARNING: Using default optimizer (AdamW from Torch)
                                                                                                         Found 0 reg images. 0%|                                                              | 0/4 [00:00<?, ?it/s]
Preparing dataset...
Init dataset!
Preparing Dataset (With Caching)
                                                                                                         Loading cached latents...|                                                           | 0/4 [00:00<?, ?it/s]
Bucket 0 (512, 512, 0) - Instance Images: 4 | Class Images: 0 | Max Examples/batch: 4
Total Buckets 1 - Instance Images: 4 | Class Images: 0 | Max Examples/batch: 4                            

Total images / batch: 4, total examples: 4████████████████████████████████| 4/4 [00:00<00:00, 8490.49it/s]
                                                                                                         Total dataset length (steps): 4                                                                            
                  Initializing bucket counter!
Steps:   0%|                                                                      | 0/400 [00:00<?, ?it/s]OOM Detected, reducing batch/grad size to 0/1.
Traceback (most recent call last):
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/memory.py", line 119, in decorator
    return function(batch_size, grad_size, prof, *args, **kwargs)
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/train_dreambooth.py", line 1355, in inner_loop
    optimizer.step()
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/accelerate/optimizer.py", line 140, in step
    self.optimizer.step(closure)
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/torch/optim/optimizer.py", line 33, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/torch/optim/adamw.py", line 171, in step
    adamw(
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/torch/optim/adamw.py", line 321, in adamw
    func(
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/venv/lib64/python3.10/site-packages/torch/optim/adamw.py", line 566, in _multi_tensor_adamw
    denom = torch._foreach_add(exp_avg_sq_sqrt, eps)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 114.00 MiB (GPU 0; 23.69 GiB total capacity; 22.18 GiB already allocated; 70.50 MiB free; 22.49 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Steps:   0%|                                                                      | 0/400 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/ui_functions.py", line 729, in start_training
    result = main(class_gen_method=class_gen_method)
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/train_dreambooth.py", line 1548, in main
    return inner_loop()
  File "/home/david/Development/machine-learning/stable-diffusion/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/memory.py", line 117, in decorator
    raise RuntimeError("No executable batch size found, reached zero.")
RuntimeError: No executable batch size found, reached zero.
Restored system models.
Duration: 00:00:04

Additional information

I'm using Fedora 38 with CUDA 11.8 and Nvidia drivers 535 (although I also tried downgrading to 525)

d8ahazard / sd_dreambooth_extension

[Bug]: Instant OOM with 3090 Ti, with any combination of settings #1311