Closed DanielWeiner closed 1 year ago
What happens if you set "Gradient Accumulation Steps" to 4?
Same issue
I'm also seeing some VRAM not being freed after the training attempt. Left hand side of the graph is before training, then some spikes during training setup, then a bit higher VRAM after it aborts.
Same here
same issue
same here, 3070ti 8 Gb desktop, worked well before
Same issue, no problem before updating. I am using 3060 12GB to train same set of data and parameters. Still works before lunch :) just curious, since the UI changes a lot, could that cause the problem?
same
@d8ahazard until this is fixed, is there a commit hash you recommend reverting to?
Between this extension changing and the core A1111 issuing bunches of updates in the last 24 hours (seemed like every time I restarted, I got more updates), I wasn't sure what was the problem. Glad to see we're all in the same boat.
First config.json and then the error below.
JSON
{
"attention": "xformers",
"cache_latents": true,
"center_crop": false,
"clip_skip": 1,
"concepts_path": "",
"custom_model_name": "",
"epoch": 0,
"epoch_pause_frequency": 0.0,
"epoch_pause_time": 0.0,
"gradient_accumulation_steps": 1,
"gradient_checkpointing": false,
"gradient_set_to_none": true,
"graph_smoothing": 50.0,
"half_model": false,
"hflip": false,
"learning_rate": 2e-06,
"learning_rate_min": 1e-06,
"lora_learning_rate": 0.0002,
"lora_model_name": "",
"lora_txt_learning_rate": 0.0002,
"lora_txt_weight": 1,
"lora_weight": 1,
"lr_cycles": 1,
"lr_factor": 0.5,
"lr_power": 1,
"lr_scale_pos": 0.5,
"lr_scheduler": "cosine",
"lr_warmup_steps": 0,
"max_token_length": 75,
"mixed_precision": "fp16",
"model_dir": "G:\\GitHub\\SDWebUI\\models\\dreambooth\\accjrdb",
"model_name": "accjrdb",
"num_train_epochs": 100,
"pad_tokens": false,
"pretrained_model_name_or_path": "G:\\GitHub\\SDWebUI\\models\\dreambooth\\accjrdb\\working",
"pretrained_vae_name_or_path": "",
"prior_loss_weight": 1,
"resolution": 768,
"revision": 0,
"sample_batch_size": 1,
"sanity_prompt": "",
"sanity_seed": 420420.0,
"save_ckpt_after": true,
"save_ckpt_cancel": false,
"save_ckpt_during": false,
"save_embedding_every": 25,
"save_lora_after": true,
"save_lora_cancel": false,
"save_lora_during": false,
"save_preview_every": 0,
"save_state_after": false,
"save_state_cancel": false,
"save_state_during": false,
"src": "G:\\GitHub\\SDWebUI\\models\\Stable-diffusion\\v2-1_768-ema-pruned.ckpt",
"shuffle_tags": false,
"train_batch_size": 1,
"stop_text_encoder": 75.0,
"use_8bit_adam": true,
"use_concepts": false,
"use_ema": false,
"use_lora": true,
"use_subdir": true,
"scheduler": "ddim",
"v2": true,
"has_ema": "True",
"concepts_list": [
{
"instance_data_dir": "G:\\Program Files\\BatchImageCropper\\WorkFolder\\Family\\Source Images\\Anthony\\Processed768DB",
"class_data_dir": "",
"instance_prompt": "[filewords]",
"class_prompt": "[filewords]",
"save_sample_prompt": "[filewords]",
"save_sample_template": "",
"instance_token": "",
"class_token": "",
"num_class_images": 140,
"class_negative_prompt": "blurry, blurred, grainy, tiling, ugly, deformed, disfigured, extra limbs, bad anatomy, poorly drawn face, poorly drawn hands, poorly drawn feet, out of frame, body out of frame, watermark, signature, text, blocks, jpeg, jpg, cut off, draft.",
"class_guidance_scale": 7.5,
"class_infer_steps": 40,
"save_sample_negative_prompt": "",
"n_save_sample": 1,
"sample_seed": -1,
"save_guidance_scale": 7.5,
"save_infer_steps": 40
}
],
"lifetime_revision": 0
}
Injecting trainable lora...
CUDA SETUP: Loading binary G:\GitHub\SDWebUI\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cudaall.dll...
prepare train images.
14 train images with repeating.
prepare reg images.
140 reg images.
prepare dataset
Caching latents with buckets...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 154/154 [00:31<00:00, 4.91it/s]
Bucket 0: Resolution (216, 768), Count: 0
Bucket 1: Resolution (280, 768), Count: 0
Bucket 2: Resolution (344, 768), Count: 0
Bucket 3: Resolution (408, 768), Count: 0
Bucket 4: Resolution (472, 768), Count: 0
Bucket 5: Resolution (536, 768), Count: 0
Bucket 6: Resolution (600, 768), Count: 0
Bucket 7: Resolution (664, 768), Count: 0
Bucket 8: Resolution (728, 768), Count: 0
Bucket 9: Resolution (768, 216), Count: 0
Bucket 10: Resolution (768, 280), Count: 0
Bucket 11: Resolution (768, 344), Count: 0
Bucket 12: Resolution (768, 408), Count: 0
Bucket 13: Resolution (768, 472), Count: 0
Bucket 14: Resolution (768, 536), Count: 0
Bucket 15: Resolution (768, 600), Count: 0
Bucket 16: Resolution (768, 664), Count: 0
Bucket 17: Resolution (768, 728), Count: 0
Bucket 18: Resolution (768, 768), Count: 28
Sched breakpoint is 700
***** Running training *****
Instance Images: 14
Class Images: 140
Total Examples: 28
Num batches each epoch = 28
Num Epochs = 100
Batch Size Per Device = 1
Gradient Accumulation steps = 1
Total train batch size (w. parallel, distributed & accumulation) = 28
Total optimization steps = 1400
Total training steps = 2800
Resuming from checkpoint: False
First resume epoch: 0
First resume step: 0
Lora: True, Adam: True, Prec: fp16
Gradient Checkpointing: False, Text Enc Steps: 75.0
EMA: False
LR: 2e-06)
Steps: 0%| | 0/2800 [00:00<?, ?it/s]OOM Detected, reducing batch/grad size to 0/1.
Traceback (most recent call last):
File "G:\GitHub\SDWebUI\extensions\sd_dreambooth_extension\dreambooth\memory.py", line 86, in decorator
return function(batch_size, grad_size, *args, **kwargs)
File "G:\GitHub\SDWebUI\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 888, in inner_loop
noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
File "G:\GitHub\SDWebUI\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "G:\GitHub\SDWebUI\venv\lib\site-packages\accelerate\utils\operations.py", line 490, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "G:\GitHub\SDWebUI\venv\lib\site-packages\torch\amp\autocast_mode.py", line 12, in decorate_autocast
return func(*args, **kwargs)
File "G:\GitHub\SDWebUI\venv\lib\site-packages\diffusers\models\unet_2d_condition.py", line 407, in forward
sample = upsample_block(
File "G:\GitHub\SDWebUI\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "G:\GitHub\SDWebUI\venv\lib\site-packages\diffusers\models\unet_2d_blocks.py", line 1202, in forward
hidden_states = resnet(hidden_states, temb)
File "G:\GitHub\SDWebUI\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "G:\GitHub\SDWebUI\venv\lib\site-packages\diffusers\models\resnet.py", line 474, in forward
hidden_states = self.conv2(hidden_states)
File "G:\GitHub\SDWebUI\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "G:\GitHub\SDWebUI\venv\lib\site-packages\torch\nn\modules\conv.py", line 457, in forward
return self._conv_forward(input, self.weight, self.bias)
File "G:\GitHub\SDWebUI\venv\lib\site-packages\torch\nn\modules\conv.py", line 453, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA out of memory. Tried to allocate 30.00 MiB (GPU 0; 8.00 GiB total capacity; 7.21 GiB already allocated; 0 bytes free; 7.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Steps: 0%| | 0/2800 [00:01<?, ?it/s]
Traceback (most recent call last):
File "G:\GitHub\SDWebUI\extensions\sd_dreambooth_extension\scripts\dreambooth.py", line 569, in start_training
result = main(config, use_subdir=use_subdir, lora_model=lora_model_name,
File "G:\GitHub\SDWebUI\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 1024, in main
return inner_loop()
File "G:\GitHub\SDWebUI\extensions\sd_dreambooth_extension\dreambooth\memory.py", line 84, in decorator
raise RuntimeError("No executable batch size found, reached zero.")
RuntimeError: No executable batch size found, reached zero.
Training completed, reloading SD Model.
Restored system models.
Returning result: Exception training model: No executable batch size found, reached zero.
Getting this same issue, can no longer train on 8GB and after a lot of trial and error end up with the same error as @DoughyInTheMiddle's last message, it was working with the previous version of the extension perfectly, decided to hit update on the UI and damn, never have I hated myself more, the whole layout of the dreambooth tab changed, configs no longer work, OOM of memory and errors everywhere, now I can't continue training and have no idea how to revert to the previous version of the extension from when it was working, can anyone help me figure out the commit where the extension UI layout was changed so I can roll back to it?
Did anyone solve the problem? haha
Did anyone solve the problem? haha
For me the only way to use the extension on an 8GB graphic card was to roll back to this commit, anything after that will just not work anymore, at least for me, even the commit after that one throws some errors for me, so, I guess I will be using it until automatic's UI get updated again and breaks it, using Dreambooth with low resources is not something any developer seem to care about as they usually have good graphics cards and don't care about those of us with old or cheap graphic cards so, I won't keep my hopes high on it being fixed or anything on the latest version.
0901c17 will fix some of the vram issue added yesterday
has anyone got it working on 8gb vram yet?
Nope, just did a fresh install. Still OOM.
Nope, just did a fresh install. Still OOM.
i used this version and it worked for me, you would need to download the repo and put it in the extensions folder yourself
https://github.com/d8ahazard/sd_dreambooth_extension/tree/c5cb58328c555ac27679422b1da940a9b19de6f2
i get this error
`Dreambooth revision: SD-WebUI revision:
Checking Dreambooth requirements... [+] bitsandbytes version 0.35.0 installed. [+] diffusers version 0.10.2 installed. [+] transformers version 4.25.1 installed. [+] xformers version 0.0.14.dev0 installed. [+] torch version 1.12.1+cu116 installed. [+] torchvision version 0.13.1+cu116 installed. #######################################################################################################
Installing requirements for dataset-tag-editor [onnxruntime-gpu]
Launching Web UI with arguments: --xformers
Traceback (most recent call last):
File "I:\stablediffusion\stable-diffusion-webui\launch.py", line 252, in
`
i get this error
`Dreambooth revision: SD-WebUI revision:
Checking Dreambooth requirements... [+] bitsandbytes version 0.35.0 installed. [+] diffusers version 0.10.2 installed. [+] transformers version 4.25.1 installed. [+] xformers version 0.0.14.dev0 installed. [+] torch version 1.12.1+cu116 installed. [+] torchvision version 0.13.1+cu116 installed. #######################################################################################################
Installing requirements for dataset-tag-editor [onnxruntime-gpu]
Launching Web UI with arguments: --xformers Traceback (most recent call last): File "I:\stablediffusion\stable-diffusion-webui\launch.py", line 252, in start() File "I:\stablediffusion\stable-diffusion-webui\launch.py", line 243, in start import webui File "I:\stablediffusion\stable-diffusion-webui\webui.py", line 12, in from modules import devices, sd_samplers, upscaler, extensions File "I:\stablediffusion\stable-diffusion-webui\modules\sd_samplers.py", line 11, in from modules import prompt_parser, devices, processing, images File "I:\stablediffusion\stable-diffusion-webui\modules\processing.py", line 14, in import modules.sd_hijack File "I:\stablediffusion\stable-diffusion-webui\modules\sd_hijack.py", line 10, in import modules.textual_inversion.textual_inversion File "I:\stablediffusion\stable-diffusion-webui\modules\textual_inversion\textual_inversion.py", line 13, in from modules import shared, devices, sd_hijack, processing, sd_models, images File "I:\stablediffusion\stable-diffusion-webui\modules\shared.py", line 15, in import modules.sd_models File "I:\stablediffusion\stable-diffusion-webui\modules\sd_models.py", line 14, in from modules.sd_hijack_inpainting import do_inpainting_hijack, should_hijack_inpainting File "I:\stablediffusion\stable-diffusion-webui\modules\sd_hijack_inpainting.py", line 6, in import ldm.models.diffusion.ddpm File "I:\stablediffusion\stable-diffusion-webui\repositories\stable-diffusion\ldm\models\diffusion\ddpm.py", line 12, in import pytorch_lightning as pl File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\pytorch_lightninginit.py", line 34, in from pytorch_lightning.callbacks import Callback # noqa: E402 File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\pytorch_lightning\callbacksinit.py", line 14, in from pytorch_lightning.callbacks.callback import Callback File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\pytorch_lightning\callbacks\callback.py", line 25, in from pytorch_lightning.utilities.types import STEP_OUTPUT File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\pytorch_lightning\utilities\types.py", line 28, in from torchmetrics import Metric File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torchmetricsinit.py", line 14, in from torchmetrics import functional # noqa: E402 File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torchmetrics\functionalinit.py", line 77, in from torchmetrics.functional.text.bleu import bleu_score File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torchmetrics\functional\textinit.py", line 30, in from torchmetrics.functional.text.bert import bert_score # noqa: F401 File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torchmetrics\functional\text\bert.py", line 24, in from torchmetrics.functional.text.helper_embedding_metric import ( File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torchmetrics\functional\text\helper_embedding_metric.py", line 26, in from transformers import AutoModelForMaskedLM, AutoTokenizer, PreTrainedModel, PreTrainedTokenizerBase File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\transformersinit.py", line 30, in from . import dependency_versions_check File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\transformers\dependency_versions_check.py", line 17, in from .utils.versions import require_version, require_version_core File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\transformers\utilsinit.py", line 34, in from .generic import ( File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\transformers\utils\generic.py", line 33, in import tensorflow as tf File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflowinit.py", line 37, in from tensorflow.python.tools import module_util as _module_util File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\pythoninit.py", line 45, in from tensorflow.python.feature_column import feature_column_lib as feature_column File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\feature_column\feature_column_lib.py", line 18, in from tensorflow.python.feature_column.feature_column import * File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\feature_column\feature_column.py", line 143, in from tensorflow.python.layers import base File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\layers\base.py", line 16, in from tensorflow.python.keras.legacy_tf_layers import base File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\kerasinit.py", line 25, in from tensorflow.python.keras import models File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\keras\models.py", line 19, in from tensorflow.python.keras import backend File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\keras\backend.py", line 50, in from tensorflow.python.keras.distribute import distribute_coordinator_utils as dc File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\keras\distribute\distribute_coordinator_utils.py", line 33, in from tensorflow.python.training import monitored_session File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 22, in from tensorflow.python.checkpoint import checkpoint as trackable_util File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\checkpoint\checkpoint.py", line 29, in from tensorflow.python.checkpoint import functional_saver File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\checkpoint\functional_saver.py", line 185, in class MultiDeviceSaver(object): File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\checkpoint\functional_saver.py", line 282, in MultiDeviceSaver def _traced_save(self, file_prefix): File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\eager\polymorphic_function\polymorphic_function.py", line 1611, in decorated decorator_func=Function( File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\eager\polymorphic_function\polymorphic_function.py", line 555, in init self._function_spec = function_spec_lib.FunctionSpec.from_function_and_signature( File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\eager\polymorphic_function\function_spec.py", line 140, in from_function_and_signature return FunctionSpec( File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\eager\polymorphic_function\function_spec.py", line 206, in init self._function_type = self._make_function_type() File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\eager\polymorphic_function\function_spec.py", line 272, in _make_function_type type_constraint = trace_type.from_value( AttributeError: module 'tensorflow.core.function.trace_type' has no attribute 'from_value' Press any key to continue . . .
`
have you got automatic 1111 setup to update itself when you run it? using git pull?
thanks alot ovladuk for your reply i was able to fix it by downloading automatic 1111 from gitgud also i used the version you mentioned it worked i dont get oom any more
thanks alot ovladuk for your reply i was able to fix it by downloading automatic 1111 from gitgud also i used the version you mentioned it worked i dont get oom any more
do you have --xformers installed? and are you enabling lora?
do you have --xformers installed? and are you enabling lora?
Yes i have enabled xformers and enabled lora ,the training finished successfully These are the parameters i used
I have this error and sometimes "Please configure some concepts." On the version indicated above, everything works. (Only training on 2 and 2.1 does not work)
Kindly read the entire form below and fill it out with the requested information.
Please find the following lines in the console and paste them below. If you do not provide this information, your issue will be automatically closed.
` Python revision: 3.9.13 (main, Aug 25 2022, 23:26:10) [GCC 11.2.0] Dreambooth revision: 4ca69a904f5ddd5651d87032b3dca515eea505ba SD-WebUI revision: e672cfb07418a1a3130d3bf21c14a0d3819f81fb
Checking Dreambooth requirements... [+] bitsandbytes version 0.35.0 installed. [+] diffusers version 0.10.2 installed. [+] transformers version 4.25.1 installed. [+] xformers version 0.0.15+e163309.d20230101 installed. [+] torch version 1.13.1 installed. [+] torchvision version 0.14.1 installed. `
Have you read the Readme? Yes Have you completely restarted the stable-diffusion-webUI, not just reloaded the UI? Yes Have you updated Dreambooth to the latest revision? Yes Have you updated the Stable-Diffusion-WebUI to the latest version? Yes No, really. Please save us both some trouble and update the SD-WebUI and Extension and restart before posting this. Reply 'OK' Below to acknowledge that you did this. OK Describe the bug
Until today I've been able to consistently train using 8Bit Adam, LORA, fp16, xformers, and no cached latents. Today with the new updates, I'm getting OOM. My GPU is RTX 3070Ti 8GB Laptop.
Provide logs
If a crash has occurred, please provide the entire stack trace from the log, including the last few log messages before the crash occurred.
Environment
What OS?
Windows, WSL2
What GPU are you using?
RTX 3070Ti 8GB Laptop
Screenshots/Config If the issue is specific to an error while training, please provide a screenshot of training parameters or the db_config.json file from /models/dreambooth/MODELNAME/db_config.json