d8ahazard / sd_dreambooth_extension

Other
1.87k stars 284 forks source link

OOM when training on an 8GB GPU #658

Closed DanielWeiner closed 1 year ago

DanielWeiner commented 1 year ago

Kindly read the entire form below and fill it out with the requested information.

Please find the following lines in the console and paste them below. If you do not provide this information, your issue will be automatically closed.

` Python revision: 3.9.13 (main, Aug 25 2022, 23:26:10) [GCC 11.2.0] Dreambooth revision: 4ca69a904f5ddd5651d87032b3dca515eea505ba SD-WebUI revision: e672cfb07418a1a3130d3bf21c14a0d3819f81fb

Checking Dreambooth requirements... [+] bitsandbytes version 0.35.0 installed. [+] diffusers version 0.10.2 installed. [+] transformers version 4.25.1 installed. [+] xformers version 0.0.15+e163309.d20230101 installed. [+] torch version 1.13.1 installed. [+] torchvision version 0.14.1 installed. `

Have you read the Readme? Yes Have you completely restarted the stable-diffusion-webUI, not just reloaded the UI? Yes Have you updated Dreambooth to the latest revision? Yes Have you updated the Stable-Diffusion-WebUI to the latest version? Yes No, really. Please save us both some trouble and update the SD-WebUI and Extension and restart before posting this. Reply 'OK' Below to acknowledge that you did this. OK Describe the bug

Until today I've been able to consistently train using 8Bit Adam, LORA, fp16, xformers, and no cached latents. Today with the new updates, I'm getting OOM. My GPU is RTX 3070Ti 8GB Laptop.

Provide logs

Returning ['xformers', False, False, 1, '', '', 0.0, 60.0, 1, True, True, 50.0, False, False, 1e-06, 1e-06, 0.0002, '', 0.0002, 1, 1, 1, 0.5, 1, 0.5, 'constant', 0, 75, 'fp16', 200, True, '', 1, 512, 1, '', 420420.0, False, False, False, 1, True, False, True, 1, False, False, False, True, 4, 150.0, True, False, False, True, True, '', 7.5, 40, '', '', 'quark', '/home/daniel/quark', '[filewords]', 'quark', 4, 0, -1, 7.5, 40, '', '', '', '', 7.5, 60, '', '', '', '', '', '', 1, 0, -1, 7.5, 60, '', '', '', '', 7.5, 60, '', '', '', '', '', '', 1, 0, -1, 7.5, 60, '', '', '', 'Loaded config.']
Saved settings.
Custom model name is
Starting Dreambooth training...
Initializing dreambooth training...
Replace CrossAttention.forward to use xformers
Injecting trainable lora...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
prepare train images.
20 train images with repeating.
0 reg images.
prepare dataset
Preparing dataset with buckets...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 258.81it/s]
Bucket 0: Resolution (144, 512), Count: 0
Bucket 1: Resolution (208, 512), Count: 0
Bucket 2: Resolution (272, 512), Count: 0
Bucket 3: Resolution (336, 512), Count: 0
Bucket 4: Resolution (400, 512), Count: 0
Bucket 5: Resolution (464, 512), Count: 0
Bucket 6: Resolution (512, 144), Count: 0
Bucket 7: Resolution (512, 208), Count: 0
Bucket 8: Resolution (512, 272), Count: 0
Bucket 9: Resolution (512, 336), Count: 0
Bucket 10: Resolution (512, 400), Count: 0
Bucket 11: Resolution (512, 464), Count: 0
Bucket 12: Resolution (512, 512), Count: 20
Sched breakpoint is 2000
  ***** Running training *****
  Instance Images: 20
  Class Images: 0
  Total Examples: 20
  Num batches each epoch = 5
  Num Epochs = 200
  Batch Size Per Device = 4
  Gradient Accumulation steps = 1
  Total train batch size (w. parallel, distributed & accumulation) = 5
  Total optimization steps = 4000
  Total training steps = 4000
  Resuming from checkpoint: False
  First resume epoch: 0
  First resume step: 0
  Lora: True, Adam: True, Prec: fp16
  Gradient Checkpointing: True, Text Enc Steps: 150.0
  EMA: False
  LR: 1e-06)
Steps:   0%|                                                                                                            | 0/4000 [00:00<?, ?it/s]OOM Detected, reducing batch/grad size to 2/1.
Traceback (most recent call last):
  File "/home/daniel/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/memory.py", line 86, in decorator
    return function(batch_size, grad_size, *args, **kwargs)
  File "/home/daniel/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/train_dreambooth.py", line 904, in inner_loop
    accelerator.backward(loss)
  File "/home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/accelerate/accelerator.py", line 1314, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/torch/autograd/function.py", line 267, in apply
    return user_fn(self, *args)
  File "/home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.00 MiB (GPU 0; 8.00 GiB total capacity; 7.08 GiB already allocated; 0 bytes free; 7.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Steps:   0%|                                                                                                            | 0/4000 [00:12<?, ?it/s]
Replace CrossAttention.forward to use xformers
Injecting trainable lora...
prepare train images.
20 train images with repeating.
0 reg images.
prepare dataset
Preparing dataset with buckets...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 208.20it/s]
Bucket 0: Resolution (144, 512), Count: 0
Bucket 1: Resolution (208, 512), Count: 0
Bucket 2: Resolution (272, 512), Count: 0
Bucket 3: Resolution (336, 512), Count: 0
Bucket 4: Resolution (400, 512), Count: 0
Bucket 5: Resolution (464, 512), Count: 0
Bucket 6: Resolution (512, 144), Count: 0
Bucket 7: Resolution (512, 208), Count: 0
Bucket 8: Resolution (512, 272), Count: 0
Bucket 9: Resolution (512, 336), Count: 0
Bucket 10: Resolution (512, 400), Count: 0
Bucket 11: Resolution (512, 464), Count: 0
Bucket 12: Resolution (512, 512), Count: 20
Sched breakpoint is 2000
OOM Detected, reducing batch/grad size to 1/1.
Traceback (most recent call last):
  File "/home/daniel/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/memory.py", line 86, in decorator
    return function(batch_size, grad_size, *args, **kwargs)
  File "/home/daniel/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/train_dreambooth.py", line 529, in inner_loop
    unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
  File "/home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/accelerate/accelerator.py", line 876, in prepare
    result = tuple(
  File "/home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/accelerate/accelerator.py", line 877, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/accelerate/accelerator.py", line 741, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/accelerate/accelerator.py", line 912, in prepare_model
    model = model.to(self.device)
  File "/home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 989, in to
    return self._apply(convert)
  File "/home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  [Previous line repeated 7 more times]
  File "/home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 664, in _apply
    param_applied = fn(param)
  File "/home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 987, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 8.00 GiB total capacity; 7.15 GiB already allocated; 0 bytes free; 7.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Replace CrossAttention.forward to use xformers
Injecting trainable lora...
prepare train images.
20 train images with repeating.
0 reg images.
prepare dataset
Preparing dataset with buckets...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 217.28it/s]
Bucket 0: Resolution (144, 512), Count: 0
Bucket 1: Resolution (208, 512), Count: 0
Bucket 2: Resolution (272, 512), Count: 0
Bucket 3: Resolution (336, 512), Count: 0
Bucket 4: Resolution (400, 512), Count: 0
Bucket 5: Resolution (464, 512), Count: 0
Bucket 6: Resolution (512, 144), Count: 0
Bucket 7: Resolution (512, 208), Count: 0
Bucket 8: Resolution (512, 272), Count: 0
Bucket 9: Resolution (512, 336), Count: 0
Bucket 10: Resolution (512, 400), Count: 0
Bucket 11: Resolution (512, 464), Count: 0
Bucket 12: Resolution (512, 512), Count: 20
Sched breakpoint is 2000
  ***** Running training *****
  Instance Images: 20
  Class Images: 0
  Total Examples: 20
  Num batches each epoch = 20
  Num Epochs = 200
  Batch Size Per Device = 1
  Gradient Accumulation steps = 1
  Total train batch size (w. parallel, distributed & accumulation) = 20
  Total optimization steps = 4000
  Total training steps = 4000
  Resuming from checkpoint: False
  First resume epoch: 0
  First resume step: 0
  Lora: True, Adam: True, Prec: fp16
  Gradient Checkpointing: True, Text Enc Steps: 150.0
  EMA: False
  LR: 1e-06)
Steps:   0%|                                                                                                            | 0/4000 [00:00<?, ?it/s]OOM Detected, reducing batch/grad size to 0/1.
Traceback (most recent call last):
  File "/home/daniel/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/memory.py", line 86, in decorator
    return function(batch_size, grad_size, *args, **kwargs)
  File "/home/daniel/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/train_dreambooth.py", line 904, in inner_loop
    accelerator.backward(loss)
  File "/home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/accelerate/accelerator.py", line 1314, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/torch/autograd/function.py", line 267, in apply
    return user_fn(self, *args)
  File "/home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/daniel/stable-diffusion-webui/venv/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 8.00 GiB total capacity; 7.20 GiB already allocated; 0 bytes free; 7.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Steps:   0%|                                                                                                            | 0/4000 [00:04<?, ?it/s]
Traceback (most recent call last):
  File "/home/daniel/stable-diffusion-webui/extensions/sd_dreambooth_extension/scripts/dreambooth.py", line 569, in start_training
    result = main(config, use_subdir=use_subdir, lora_model=lora_model_name,
  File "/home/daniel/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/train_dreambooth.py", line 1024, in main
    return inner_loop()
  File "/home/daniel/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/memory.py", line 84, in decorator
    raise RuntimeError("No executable batch size found, reached zero.")
RuntimeError: No executable batch size found, reached zero.
Training completed, reloading SD Model.
Restored system models.
Returning result: Exception training model: No executable batch size found, reached zero.

If a crash has occurred, please provide the entire stack trace from the log, including the last few log messages before the crash occurred.

Environment

What OS?

Windows, WSL2

What GPU are you using?

RTX 3070Ti 8GB Laptop

Screenshots/Config If the issue is specific to an error while training, please provide a screenshot of training parameters or the db_config.json file from /models/dreambooth/MODELNAME/db_config.json

{
    "attention": "xformers",
    "cache_latents": false,
    "center_crop": false,
    "clip_skip": 1,
    "concepts_path": "",
    "custom_model_name": "",
    "epoch": 0,
    "epoch_pause_frequency": 0.0,
    "epoch_pause_time": 60.0,
    "gradient_accumulation_steps": 1,
    "gradient_checkpointing": true,
    "gradient_set_to_none": true,
    "graph_smoothing": 50.0,
    "half_model": false,
    "hflip": false,
    "learning_rate": 1e-06,
    "learning_rate_min": 1e-06,
    "lora_learning_rate": 0.0002,
    "lora_model_name": "",
    "lora_txt_learning_rate": 0.0002,
    "lora_txt_weight": 1,
    "lora_weight": 1,
    "lr_cycles": 1,
    "lr_factor": 0.5,
    "lr_power": 1,
    "lr_scale_pos": 0.5,
    "lr_scheduler": "constant",
    "lr_warmup_steps": 0,
    "max_token_length": 75,
    "mixed_precision": "fp16",
    "model_dir": "/home/daniel/stable-diffusion-webui/models/dreambooth/quark_3700",
    "model_name": "quark_3700",
    "num_train_epochs": 200,
    "pad_tokens": true,
    "pretrained_model_name_or_path": "/home/daniel/stable-diffusion-webui/models/dreambooth/quark_3700/working",
    "pretrained_vae_name_or_path": "",
    "prior_loss_weight": 1,
    "resolution": 512,
    "revision": 0,
    "sample_batch_size": 1,
    "sanity_prompt": "",
    "sanity_seed": 420420.0,
    "save_ckpt_after": false,
    "save_ckpt_cancel": false,
    "save_ckpt_during": false,
    "save_embedding_every": 1,
    "save_lora_after": true,
    "save_lora_cancel": false,
    "save_lora_during": true,
    "save_preview_every": 1,
    "save_state_after": false,
    "save_state_cancel": false,
    "save_state_during": false,
    "src": "/home/daniel/stable-diffusion-webui/models/Stable-diffusion/quark_3700_6000_lora.ckpt",
    "shuffle_tags": true,
    "train_batch_size": 4,
    "stop_text_encoder": 150.0,
    "use_8bit_adam": true,
    "use_concepts": false,
    "use_ema": false,
    "use_lora": true,
    "use_subdir": true,
    "scheduler": "ddim",
    "v2": false,
    "has_ema": "True",
    "concepts_list": [
        {
            "instance_data_dir": "/home/daniel/quark",
            "class_data_dir": "",
            "instance_prompt": "[filewords]",
            "class_prompt": "",
            "save_sample_prompt": "",
            "save_sample_template": "",
            "instance_token": "quark",
            "class_token": "quark",
            "num_class_images": 0,
            "class_negative_prompt": "",
            "class_guidance_scale": 7.5,
            "class_infer_steps": 40,
            "save_sample_negative_prompt": "",
            "n_save_sample": 4,
            "sample_seed": -1,
            "save_guidance_scale": 7.5,
            "save_infer_steps": 40
        }
    ],
    "lifetime_revision": 0
}
d8ahazard commented 1 year ago

What happens if you set "Gradient Accumulation Steps" to 4?

DanielWeiner commented 1 year ago

Same issue

DanielWeiner commented 1 year ago
image

I'm also seeing some VRAM not being freed after the training attempt. Left hand side of the graph is before training, then some spikes during training setup, then a bit higher VRAM after it aborts.

slimjim12954 commented 1 year ago

Same here

evtapp commented 1 year ago

same issue

randaller commented 1 year ago

same here, 3070ti 8 Gb desktop, worked well before

henryvii99 commented 1 year ago

Same issue, no problem before updating. I am using 3060 12GB to train same set of data and parameters. Still works before lunch :) just curious, since the UI changes a lot, could that cause the problem?

deepseareo commented 1 year ago

same

DanielWeiner commented 1 year ago

@d8ahazard until this is fixed, is there a commit hash you recommend reverting to?

DoughyInTheMiddle commented 1 year ago

Between this extension changing and the core A1111 issuing bunches of updates in the last 24 hours (seemed like every time I restarted, I got more updates), I wasn't sure what was the problem. Glad to see we're all in the same boat.

First config.json and then the error below.

JSON

{
    "attention": "xformers",
    "cache_latents": true,
    "center_crop": false,
    "clip_skip": 1,
    "concepts_path": "",
    "custom_model_name": "",
    "epoch": 0,
    "epoch_pause_frequency": 0.0,
    "epoch_pause_time": 0.0,
    "gradient_accumulation_steps": 1,
    "gradient_checkpointing": false,
    "gradient_set_to_none": true,
    "graph_smoothing": 50.0,
    "half_model": false,
    "hflip": false,
    "learning_rate": 2e-06,
    "learning_rate_min": 1e-06,
    "lora_learning_rate": 0.0002,
    "lora_model_name": "",
    "lora_txt_learning_rate": 0.0002,
    "lora_txt_weight": 1,
    "lora_weight": 1,
    "lr_cycles": 1,
    "lr_factor": 0.5,
    "lr_power": 1,
    "lr_scale_pos": 0.5,
    "lr_scheduler": "cosine",
    "lr_warmup_steps": 0,
    "max_token_length": 75,
    "mixed_precision": "fp16",
    "model_dir": "G:\\GitHub\\SDWebUI\\models\\dreambooth\\accjrdb",
    "model_name": "accjrdb",
    "num_train_epochs": 100,
    "pad_tokens": false,
    "pretrained_model_name_or_path": "G:\\GitHub\\SDWebUI\\models\\dreambooth\\accjrdb\\working",
    "pretrained_vae_name_or_path": "",
    "prior_loss_weight": 1,
    "resolution": 768,
    "revision": 0,
    "sample_batch_size": 1,
    "sanity_prompt": "",
    "sanity_seed": 420420.0,
    "save_ckpt_after": true,
    "save_ckpt_cancel": false,
    "save_ckpt_during": false,
    "save_embedding_every": 25,
    "save_lora_after": true,
    "save_lora_cancel": false,
    "save_lora_during": false,
    "save_preview_every": 0,
    "save_state_after": false,
    "save_state_cancel": false,
    "save_state_during": false,
    "src": "G:\\GitHub\\SDWebUI\\models\\Stable-diffusion\\v2-1_768-ema-pruned.ckpt",
    "shuffle_tags": false,
    "train_batch_size": 1,
    "stop_text_encoder": 75.0,
    "use_8bit_adam": true,
    "use_concepts": false,
    "use_ema": false,
    "use_lora": true,
    "use_subdir": true,
    "scheduler": "ddim",
    "v2": true,
    "has_ema": "True",
    "concepts_list": [
        {
            "instance_data_dir": "G:\\Program Files\\BatchImageCropper\\WorkFolder\\Family\\Source Images\\Anthony\\Processed768DB",
            "class_data_dir": "",
            "instance_prompt": "[filewords]",
            "class_prompt": "[filewords]",
            "save_sample_prompt": "[filewords]",
            "save_sample_template": "",
            "instance_token": "",
            "class_token": "",
            "num_class_images": 140,
            "class_negative_prompt": "blurry, blurred, grainy, tiling, ugly, deformed, disfigured, extra limbs, bad anatomy, poorly drawn face, poorly drawn hands, poorly drawn feet, out of frame, body out of frame, watermark, signature, text, blocks, jpeg, jpg, cut off, draft.",
            "class_guidance_scale": 7.5,
            "class_infer_steps": 40,
            "save_sample_negative_prompt": "",
            "n_save_sample": 1,
            "sample_seed": -1,
            "save_guidance_scale": 7.5,
            "save_infer_steps": 40
        }
    ],
    "lifetime_revision": 0
}
Injecting trainable lora...
CUDA SETUP: Loading binary G:\GitHub\SDWebUI\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cudaall.dll...
prepare train images.
14 train images with repeating.
prepare reg images.
140 reg images.
prepare dataset
Caching latents with buckets...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 154/154 [00:31<00:00,  4.91it/s]
Bucket 0: Resolution (216, 768), Count: 0
Bucket 1: Resolution (280, 768), Count: 0
Bucket 2: Resolution (344, 768), Count: 0
Bucket 3: Resolution (408, 768), Count: 0
Bucket 4: Resolution (472, 768), Count: 0
Bucket 5: Resolution (536, 768), Count: 0
Bucket 6: Resolution (600, 768), Count: 0
Bucket 7: Resolution (664, 768), Count: 0
Bucket 8: Resolution (728, 768), Count: 0
Bucket 9: Resolution (768, 216), Count: 0
Bucket 10: Resolution (768, 280), Count: 0
Bucket 11: Resolution (768, 344), Count: 0
Bucket 12: Resolution (768, 408), Count: 0
Bucket 13: Resolution (768, 472), Count: 0
Bucket 14: Resolution (768, 536), Count: 0
Bucket 15: Resolution (768, 600), Count: 0
Bucket 16: Resolution (768, 664), Count: 0
Bucket 17: Resolution (768, 728), Count: 0
Bucket 18: Resolution (768, 768), Count: 28
Sched breakpoint is 700
  ***** Running training *****
  Instance Images: 14
  Class Images: 140
  Total Examples: 28
  Num batches each epoch = 28
  Num Epochs = 100
  Batch Size Per Device = 1
  Gradient Accumulation steps = 1
  Total train batch size (w. parallel, distributed & accumulation) = 28
  Total optimization steps = 1400
  Total training steps = 2800
  Resuming from checkpoint: False
  First resume epoch: 0
  First resume step: 0
  Lora: True, Adam: True, Prec: fp16
  Gradient Checkpointing: False, Text Enc Steps: 75.0
  EMA: False
  LR: 2e-06)
Steps:   0%|                                                                                                                                                                         | 0/2800 [00:00<?, ?it/s]OOM Detected, reducing batch/grad size to 0/1.
Traceback (most recent call last):
  File "G:\GitHub\SDWebUI\extensions\sd_dreambooth_extension\dreambooth\memory.py", line 86, in decorator
    return function(batch_size, grad_size, *args, **kwargs)
  File "G:\GitHub\SDWebUI\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 888, in inner_loop
    noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
  File "G:\GitHub\SDWebUI\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "G:\GitHub\SDWebUI\venv\lib\site-packages\accelerate\utils\operations.py", line 490, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "G:\GitHub\SDWebUI\venv\lib\site-packages\torch\amp\autocast_mode.py", line 12, in decorate_autocast
    return func(*args, **kwargs)
  File "G:\GitHub\SDWebUI\venv\lib\site-packages\diffusers\models\unet_2d_condition.py", line 407, in forward
    sample = upsample_block(
  File "G:\GitHub\SDWebUI\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "G:\GitHub\SDWebUI\venv\lib\site-packages\diffusers\models\unet_2d_blocks.py", line 1202, in forward
    hidden_states = resnet(hidden_states, temb)
  File "G:\GitHub\SDWebUI\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "G:\GitHub\SDWebUI\venv\lib\site-packages\diffusers\models\resnet.py", line 474, in forward
    hidden_states = self.conv2(hidden_states)
  File "G:\GitHub\SDWebUI\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "G:\GitHub\SDWebUI\venv\lib\site-packages\torch\nn\modules\conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "G:\GitHub\SDWebUI\venv\lib\site-packages\torch\nn\modules\conv.py", line 453, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA out of memory. Tried to allocate 30.00 MiB (GPU 0; 8.00 GiB total capacity; 7.21 GiB already allocated; 0 bytes free; 7.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Steps:   0%|                                                                                                                                                                         | 0/2800 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "G:\GitHub\SDWebUI\extensions\sd_dreambooth_extension\scripts\dreambooth.py", line 569, in start_training
    result = main(config, use_subdir=use_subdir, lora_model=lora_model_name,
  File "G:\GitHub\SDWebUI\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 1024, in main
    return inner_loop()
  File "G:\GitHub\SDWebUI\extensions\sd_dreambooth_extension\dreambooth\memory.py", line 84, in decorator
    raise RuntimeError("No executable batch size found, reached zero.")
RuntimeError: No executable batch size found, reached zero.
Training completed, reloading SD Model.
Restored system models.
Returning result: Exception training model: No executable batch size found, reached zero.
ZeroCool940711 commented 1 year ago

Getting this same issue, can no longer train on 8GB and after a lot of trial and error end up with the same error as @DoughyInTheMiddle's last message, it was working with the previous version of the extension perfectly, decided to hit update on the UI and damn, never have I hated myself more, the whole layout of the dreambooth tab changed, configs no longer work, OOM of memory and errors everywhere, now I can't continue training and have no idea how to revert to the previous version of the extension from when it was working, can anyone help me figure out the commit where the extension UI layout was changed so I can roll back to it?

tykim9999 commented 1 year ago

Did anyone solve the problem? haha

ZeroCool940711 commented 1 year ago

Did anyone solve the problem? haha

For me the only way to use the extension on an 8GB graphic card was to roll back to this commit, anything after that will just not work anymore, at least for me, even the commit after that one throws some errors for me, so, I guess I will be using it until automatic's UI get updated again and breaks it, using Dreambooth with low resources is not something any developer seem to care about as they usually have good graphics cards and don't care about those of us with old or cheap graphic cards so, I won't keep my hopes high on it being fixed or anything on the latest version.

ArrowM commented 1 year ago

0901c17 will fix some of the vram issue added yesterday

ovladuk commented 1 year ago

has anyone got it working on 8gb vram yet?

DanielWeiner commented 1 year ago

Nope, just did a fresh install. Still OOM.

ovladuk commented 1 year ago

Nope, just did a fresh install. Still OOM.

i used this version and it worked for me, you would need to download the repo and put it in the extensions folder yourself

https://github.com/d8ahazard/sd_dreambooth_extension/tree/c5cb58328c555ac27679422b1da940a9b19de6f2

kotaxyz commented 1 year ago

i get this error

`Dreambooth revision: SD-WebUI revision:

Checking Dreambooth requirements... [+] bitsandbytes version 0.35.0 installed. [+] diffusers version 0.10.2 installed. [+] transformers version 4.25.1 installed. [+] xformers version 0.0.14.dev0 installed. [+] torch version 1.12.1+cu116 installed. [+] torchvision version 0.13.1+cu116 installed. #######################################################################################################

Installing requirements for dataset-tag-editor [onnxruntime-gpu]

Launching Web UI with arguments: --xformers Traceback (most recent call last): File "I:\stablediffusion\stable-diffusion-webui\launch.py", line 252, in start() File "I:\stablediffusion\stable-diffusion-webui\launch.py", line 243, in start import webui File "I:\stablediffusion\stable-diffusion-webui\webui.py", line 12, in from modules import devices, sd_samplers, upscaler, extensions File "I:\stablediffusion\stable-diffusion-webui\modules\sd_samplers.py", line 11, in from modules import prompt_parser, devices, processing, images File "I:\stablediffusion\stable-diffusion-webui\modules\processing.py", line 14, in import modules.sd_hijack File "I:\stablediffusion\stable-diffusion-webui\modules\sd_hijack.py", line 10, in import modules.textual_inversion.textual_inversion File "I:\stablediffusion\stable-diffusion-webui\modules\textual_inversion\textual_inversion.py", line 13, in from modules import shared, devices, sd_hijack, processing, sd_models, images File "I:\stablediffusion\stable-diffusion-webui\modules\shared.py", line 15, in import modules.sd_models File "I:\stablediffusion\stable-diffusion-webui\modules\sd_models.py", line 14, in from modules.sd_hijack_inpainting import do_inpainting_hijack, should_hijack_inpainting File "I:\stablediffusion\stable-diffusion-webui\modules\sd_hijack_inpainting.py", line 6, in import ldm.models.diffusion.ddpm File "I:\stablediffusion\stable-diffusion-webui\repositories\stable-diffusion\ldm\models\diffusion\ddpm.py", line 12, in import pytorch_lightning as pl File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\pytorch_lightning__init.py", line 34, in from pytorch_lightning.callbacks import Callback # noqa: E402 File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\pytorch_lightning\callbacks__init.py", line 14, in from pytorch_lightning.callbacks.callback import Callback File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\pytorch_lightning\callbacks\callback.py", line 25, in from pytorch_lightning.utilities.types import STEP_OUTPUT File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\pytorch_lightning\utilities\types.py", line 28, in from torchmetrics import Metric File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torchmetrics__init.py", line 14, in from torchmetrics import functional # noqa: E402 File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torchmetrics\functional__init.py", line 77, in from torchmetrics.functional.text.bleu import bleu_score File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torchmetrics\functional\text__init__.py", line 30, in from torchmetrics.functional.text.bert import bert_score # noqa: F401 File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torchmetrics\functional\text\bert.py", line 24, in from torchmetrics.functional.text.helper_embedding_metric import ( File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torchmetrics\functional\text\helper_embedding_metric.py", line 26, in from transformers import AutoModelForMaskedLM, AutoTokenizer, PreTrainedModel, PreTrainedTokenizerBase File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\transformers\init__.py", line 30, in from . import dependency_versions_check File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\transformers\dependency_versions_check.py", line 17, in from .utils.versions import require_version, require_version_core File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\transformers\utils\init.py", line 34, in from .generic import ( File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\transformers\utils\generic.py", line 33, in import tensorflow as tf File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\init__.py", line 37, in from tensorflow.python.tools import module_util as _module_util File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\init__.py", line 45, in from tensorflow.python.feature_column import feature_column_lib as feature_column File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\feature_column\feature_column_lib.py", line 18, in from tensorflow.python.feature_column.feature_column import * File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\feature_column\feature_column.py", line 143, in from tensorflow.python.layers import base File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\layers\base.py", line 16, in from tensorflow.python.keras.legacy_tf_layers import base File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\keras\init.py", line 25, in from tensorflow.python.keras import models File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\keras\models.py", line 19, in from tensorflow.python.keras import backend File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\keras\backend.py", line 50, in from tensorflow.python.keras.distribute import distribute_coordinator_utils as dc File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\keras\distribute\distribute_coordinator_utils.py", line 33, in from tensorflow.python.training import monitored_session File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 22, in from tensorflow.python.checkpoint import checkpoint as trackable_util File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\checkpoint\checkpoint.py", line 29, in from tensorflow.python.checkpoint import functional_saver File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\checkpoint\functional_saver.py", line 185, in class MultiDeviceSaver(object): File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\checkpoint\functional_saver.py", line 282, in MultiDeviceSaver def _traced_save(self, file_prefix): File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\eager\polymorphic_function\polymorphic_function.py", line 1611, in decorated decorator_func=Function( File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\eager\polymorphic_function\polymorphic_function.py", line 555, in init self._function_spec = function_spec_lib.FunctionSpec.from_function_and_signature( File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\eager\polymorphic_function\function_spec.py", line 140, in from_function_and_signature return FunctionSpec( File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\eager\polymorphic_function\function_spec.py", line 206, in init__ self._function_type = self._make_function_type() File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\eager\polymorphic_function\function_spec.py", line 272, in _make_function_type type_constraint = trace_type.from_value( AttributeError: module 'tensorflow.core.function.trace_type' has no attribute 'from_value' Press any key to continue . . .

`

ovladuk commented 1 year ago

i get this error

`Dreambooth revision: SD-WebUI revision:

Checking Dreambooth requirements... [+] bitsandbytes version 0.35.0 installed. [+] diffusers version 0.10.2 installed. [+] transformers version 4.25.1 installed. [+] xformers version 0.0.14.dev0 installed. [+] torch version 1.12.1+cu116 installed. [+] torchvision version 0.13.1+cu116 installed. #######################################################################################################

Installing requirements for dataset-tag-editor [onnxruntime-gpu]

Launching Web UI with arguments: --xformers Traceback (most recent call last): File "I:\stablediffusion\stable-diffusion-webui\launch.py", line 252, in start() File "I:\stablediffusion\stable-diffusion-webui\launch.py", line 243, in start import webui File "I:\stablediffusion\stable-diffusion-webui\webui.py", line 12, in from modules import devices, sd_samplers, upscaler, extensions File "I:\stablediffusion\stable-diffusion-webui\modules\sd_samplers.py", line 11, in from modules import prompt_parser, devices, processing, images File "I:\stablediffusion\stable-diffusion-webui\modules\processing.py", line 14, in import modules.sd_hijack File "I:\stablediffusion\stable-diffusion-webui\modules\sd_hijack.py", line 10, in import modules.textual_inversion.textual_inversion File "I:\stablediffusion\stable-diffusion-webui\modules\textual_inversion\textual_inversion.py", line 13, in from modules import shared, devices, sd_hijack, processing, sd_models, images File "I:\stablediffusion\stable-diffusion-webui\modules\shared.py", line 15, in import modules.sd_models File "I:\stablediffusion\stable-diffusion-webui\modules\sd_models.py", line 14, in from modules.sd_hijack_inpainting import do_inpainting_hijack, should_hijack_inpainting File "I:\stablediffusion\stable-diffusion-webui\modules\sd_hijack_inpainting.py", line 6, in import ldm.models.diffusion.ddpm File "I:\stablediffusion\stable-diffusion-webui\repositories\stable-diffusion\ldm\models\diffusion\ddpm.py", line 12, in import pytorch_lightning as pl File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\pytorch_lightninginit.py", line 34, in from pytorch_lightning.callbacks import Callback # noqa: E402 File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\pytorch_lightning\callbacksinit.py", line 14, in from pytorch_lightning.callbacks.callback import Callback File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\pytorch_lightning\callbacks\callback.py", line 25, in from pytorch_lightning.utilities.types import STEP_OUTPUT File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\pytorch_lightning\utilities\types.py", line 28, in from torchmetrics import Metric File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torchmetricsinit.py", line 14, in from torchmetrics import functional # noqa: E402 File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torchmetrics\functionalinit.py", line 77, in from torchmetrics.functional.text.bleu import bleu_score File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torchmetrics\functional\textinit.py", line 30, in from torchmetrics.functional.text.bert import bert_score # noqa: F401 File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torchmetrics\functional\text\bert.py", line 24, in from torchmetrics.functional.text.helper_embedding_metric import ( File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torchmetrics\functional\text\helper_embedding_metric.py", line 26, in from transformers import AutoModelForMaskedLM, AutoTokenizer, PreTrainedModel, PreTrainedTokenizerBase File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\transformersinit.py", line 30, in from . import dependency_versions_check File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\transformers\dependency_versions_check.py", line 17, in from .utils.versions import require_version, require_version_core File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\transformers\utilsinit.py", line 34, in from .generic import ( File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\transformers\utils\generic.py", line 33, in import tensorflow as tf File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflowinit.py", line 37, in from tensorflow.python.tools import module_util as _module_util File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\pythoninit.py", line 45, in from tensorflow.python.feature_column import feature_column_lib as feature_column File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\feature_column\feature_column_lib.py", line 18, in from tensorflow.python.feature_column.feature_column import * File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\feature_column\feature_column.py", line 143, in from tensorflow.python.layers import base File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\layers\base.py", line 16, in from tensorflow.python.keras.legacy_tf_layers import base File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\kerasinit.py", line 25, in from tensorflow.python.keras import models File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\keras\models.py", line 19, in from tensorflow.python.keras import backend File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\keras\backend.py", line 50, in from tensorflow.python.keras.distribute import distribute_coordinator_utils as dc File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\keras\distribute\distribute_coordinator_utils.py", line 33, in from tensorflow.python.training import monitored_session File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 22, in from tensorflow.python.checkpoint import checkpoint as trackable_util File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\checkpoint\checkpoint.py", line 29, in from tensorflow.python.checkpoint import functional_saver File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\checkpoint\functional_saver.py", line 185, in class MultiDeviceSaver(object): File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\checkpoint\functional_saver.py", line 282, in MultiDeviceSaver def _traced_save(self, file_prefix): File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\eager\polymorphic_function\polymorphic_function.py", line 1611, in decorated decorator_func=Function( File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\eager\polymorphic_function\polymorphic_function.py", line 555, in init self._function_spec = function_spec_lib.FunctionSpec.from_function_and_signature( File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\eager\polymorphic_function\function_spec.py", line 140, in from_function_and_signature return FunctionSpec( File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\eager\polymorphic_function\function_spec.py", line 206, in init self._function_type = self._make_function_type() File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\eager\polymorphic_function\function_spec.py", line 272, in _make_function_type type_constraint = trace_type.from_value( AttributeError: module 'tensorflow.core.function.trace_type' has no attribute 'from_value' Press any key to continue . . .

`

have you got automatic 1111 setup to update itself when you run it? using git pull?

kotaxyz commented 1 year ago

thanks alot ovladuk for your reply i was able to fix it by downloading automatic 1111 from gitgud also i used the version you mentioned it worked i dont get oom any more

ovladuk commented 1 year ago

thanks alot ovladuk for your reply i was able to fix it by downloading automatic 1111 from gitgud also i used the version you mentioned it worked i dont get oom any more

do you have --xformers installed? and are you enabling lora?

kotaxyz commented 1 year ago

do you have --xformers installed? and are you enabling lora?

Yes i have enabled xformers and enabled lora ,the training finished successfully These are the parameters i used lora dreambooth

Gor80hd commented 1 year ago

I have this error and sometimes "Please configure some concepts." On the version indicated above, everything works. (Only training on 2 and 2.1 does not work)