AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI
GNU Affero General Public License v3.0
141.57k stars 26.75k forks source link

[Bug]: Memory leak (CUDA out of memory) with my RTX A6000 (48GB vram) #8998

Closed Kobeb33f closed 1 year ago

Kobeb33f commented 1 year ago

Is there an existing issue for this?

What happened?

Hi everyone,

I'm totally out of idea on how solving my issue.

I have an RTX A6000, I could usually generate 2048x2048 image with no problem on stable diffusion. Since few days I can't generate eveb 1024x1024 images with this issue :

OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 GiB (GPU 0; 47.99 GiB total capacity; 18.36 GiB already allocated; 10.48 GiB free; 34.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Time taken: 5.53sTorch active/reserved: 35120/35224 MiB, Sys VRAM: 38455/49140 MiB (78.26%)

This is strange because I'm only using Stable diffusion and I'm using only 2 or 3 gb of vram before generate an image.

I didn't install anything new the last day except LoRa kohya version on another folder.

Any idea ?

Thank you for your help

Steps to reproduce the problem

Generating any image of 1024 x 1024 or more

What should have happened?

I can generate 2048*2048 image usually but not since few days ago

Commit where the problem happens

python: 3.10.9  •  torch: 1.13.1+cu117  •  xformers: N/A  •  gradio: 3.16.2  •  commit: 0cc0ee1b  •  checkpoint: f28a232119

What platforms do you use to access the UI ?

Windows

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

usually none but 
this : --xformers --precision full --no-half --opt-sub-quad-attention --opt-split-attention-v1 --disable-nan-check --api  allow me to generate it with 1024*1024 but performance are lower than usual

List of extensions

openOutpaint openpose-editor
posex
sd-webui-controlnet
sd-webui-depth-lib
stable-diffusion-webui-composable-lora
stable-diffusion-webui-two-shot
LDSR Lora
ScuNET
SwinIR
prompt-bracket-checker

Console logs

venv "C:\Users\Admin\stable-diffusion-webui\venv\Scripts\Python.exe"
Python 3.10.9 (tags/v3.10.9:1dd9be6, Dec  6 2022, 20:01:21) [MSC v.1934 64 bit (AMD64)]
Commit hash: 0cc0ee1bcb4c24a8c9715f66cede06601bfc00c8
Installing requirements for Web UI
Launching Web UI with arguments:
No module 'xformers'. Proceeding without it.
Loading weights [f28a232119] from C:\Users\Admin\stable-diffusion-webui\models\Stable-diffusion\anything-v4.0-pruned-fp32.safetensors
Creating model from config: C:\Users\Admin\stable-diffusion-webui\configs\v1-inference.yaml
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Applying cross attention optimization (Doggettx).
Textual inversion embeddings loaded(0):
Model loaded in 6.1s (create model: 1.0s, apply weights to model: 1.5s, apply half(): 0.9s, move model to device: 1.1s, load textual inversion embeddings: 1.3s).
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
  0%|                                                                                           | 0/20 [00:07<?, ?it/s]
Error completing request
Arguments: ('task(mn3n1zmenwq9fcj)', 'little wolf in the forest', '', [], 20, 0, False, False, 1, 1, 7, -1.0, -1.0, 0, 0, 0, False, 1024, 1024, False, 0.7, 2, 'Latent', 0, 0, 0, [], 0, False, False, 'positive', 'comma', 0, False, False, '', 1, '', 0, '', 0, '', True, False, False, False, 0) {}
Traceback (most recent call last):
  File "C:\Users\Admin\stable-diffusion-webui\modules\call_queue.py", line 56, in f
    res = list(func(*args, **kwargs))
  File "C:\Users\Admin\stable-diffusion-webui\modules\call_queue.py", line 37, in f
    res = func(*args, **kwargs)
  File "C:\Users\Admin\stable-diffusion-webui\modules\txt2img.py", line 56, in txt2img
    processed = process_images(p)
  File "C:\Users\Admin\stable-diffusion-webui\modules\processing.py", line 486, in process_images
    res = process_images_inner(p)
  File "C:\Users\Admin\stable-diffusion-webui\modules\processing.py", line 632, in process_images_inner
    samples_ddim = p.sample(conditioning=c, unconditional_conditioning=uc, seeds=seeds, subseeds=subseeds, subseed_strength=p.subseed_strength, prompts=prompts)
  File "C:\Users\Admin\stable-diffusion-webui\modules\processing.py", line 832, in sample
    samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
  File "C:\Users\Admin\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 349, in sample
    samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={
  File "C:\Users\Admin\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 225, in launch_sampling
    return func()
  File "C:\Users\Admin\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 349, in <lambda>
    samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={
  File "C:\Users\Admin\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\Admin\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 145, in sample_euler_ancestral
    denoised = model(x, sigmas[i] * s_in, **extra_args)
  File "C:\Users\Admin\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Admin\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 117, in forward
    x_out = self.inner_model(x_in, sigma_in, cond={"c_crossattn": [cond_in], "c_concat": [image_cond_in]})
  File "C:\Users\Admin\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Admin\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py", line 112, in forward
    eps = self.get_eps(input * c_in, self.sigma_to_t(sigma), **kwargs)
  File "C:\Users\Admin\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py", line 138, in get_eps
    return self.inner_model.apply_model(*args, **kwargs)
  File "C:\Users\Admin\stable-diffusion-webui\modules\sd_hijack_utils.py", line 17, in <lambda>
    setattr(resolved_obj, func_path[-1], lambda *args, **kwargs: self(*args, **kwargs))
  File "C:\Users\Admin\stable-diffusion-webui\modules\sd_hijack_utils.py", line 28, in __call__
    return self.__orig_func(*args, **kwargs)
  File "C:\Users\Admin\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 858, in apply_model
    x_recon = self.model(x_noisy, t, **cond)
  File "C:\Users\Admin\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Admin\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 1329, in forward
    out = self.diffusion_model(x, t, context=cc)
  File "C:\Users\Admin\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Admin\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\openaimodel.py", line 781, in forward
    h = module(h, emb, context)
  File "C:\Users\Admin\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Admin\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\openaimodel.py", line 84, in forward
    x = layer(x, context)
  File "C:\Users\Admin\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Admin\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\attention.py", line 324, in forward
    x = block(x, context=context[i])
  File "C:\Users\Admin\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Admin\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\attention.py", line 259, in forward
    return checkpoint(self._forward, (x, context), self.parameters(), self.checkpoint)
  File "C:\Users\Admin\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\util.py", line 114, in checkpoint
    return CheckpointFunction.apply(func, len(inputs), *args)
  File "C:\Users\Admin\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\util.py", line 129, in forward
    output_tensors = ctx.run_function(*ctx.input_tensors)
  File "C:\Users\Admin\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\attention.py", line 262, in _forward
    x = self.attn1(self.norm1(x), context=context if self.disable_self_attn else None) + x
  File "C:\Users\Admin\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Admin\stable-diffusion-webui\modules\sd_hijack_optimizations.py", line 129, in split_cross_attention_forward
    s2 = s1.softmax(dim=-1, dtype=q.dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 GiB (GPU 0; 47.99 GiB total capacity; 18.36 GiB already allocated; 10.48 GiB free; 34.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Additional information

No response

setothegreat commented 1 year ago

Can confirm this. On my 3090 I can only generate a single image with the hires fix before I get an OOM error when trying to generate any more images above 512x512. Of note is that in previous versions of WebUI I could usually generate an image at 512x512 with multiple ControlNet models and only use 7 gigs of VRAM per generation, which would immediately be cleared after image generation was complete. Now, after only a single image generation my VRAM usage stays at a minimum of and never goes below 10 gb, even when idle, until I close the terminal window.

Seems like a classic memory leak, and this latest revision of WebUI has a ton of errors beyond just this. Would recommend just reverting to an older commit until all this is fixed.

notkmikiya commented 1 year ago

I had this issue as well and couldn't create any hires images at 512x512 even though it was working perfectly fine before.

After taking a look at it for a long time, messing with different optimization settings, and bashing my head into the source code wall, I found something that helped me out.

It seems like there may be some issues with the setting --no-half-vae, this setting causes Stable Diffusion to suddenly want more ram when it finishes processing an image. The jump in ram is tremendous, most of the time it just asks for an additional half or all of your vram. I'm not sure if --no-half works the same way, but try removing it.

Sometimes, you may get this error if you don't have --no-half-vae. modules.devices.NansException: A tensor with all NaNs was produced in VAE. This could be because there's not enough precision to represent the picture. Try adding --no-half-vae commandline argument to fix this. Use --disable-nan-check commandline argument to disable this check.

I used the --disable-nan-check and it worked out for me. If you're getting a black image, it might be because the checkpoint you're using doesn't like what you're trying to do. For example, I tried to make a mecha checkpoint using the prompt 1girl and just got a black screen.

If you're still just a little shy of having enough memory, which is ridiculous considering your vram, you can try putting this somewhere. PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128

I have it in my webui-user.bat, but I'm not sure where you'd put that if you run it from the command line like you are doing.

Lastly, as you said, using --opt-sub-quad-attention and --opt-split-attention-v1 can help you get it running, but I think you don't need to have --opt-split-attention-v1 if you use --opt-sub-quad-attention. When using --opt-sub-quad-attention you can then specify the chunk sizes which will allow you to allocate your memory accordingly to how much you have.

Here's an example of my webui-user.bat that I'm using and have optimized for my 970 gtx.

@echo off

set PYTHON=C:\Users\KMikiya\AppData\Local\Programs\Python\Python310\python.exe
set GIT=
set VENV_DIR=
set COMMANDLINE_ARGS=--medvram --disable-nan-check --opt-sub-quad-attention --sub-quad-q-chunk-size=128 --sub-quad-kv-chunk-size=128 --sub-quad-chunk-threshold=80
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128

git pull

call webui.bat

Like this, I'm able to put out hires images with a base resolution of 512x768 with only 4gb vram and no real hit to my already slow processing speed.

Since you have so much ram you can probably try raising the values of --sub-quad-q-chunk-size=128 and --sub-quad-kv-chunk-size=128 considerably. Maybe something like 1024 or 2048 on both would be optimal for you?

Hopefully, this helps you out, as I know how frustrating it was messing around with all these settings to get Stable Diffusion to run as it did before some of the recent updates.

Kobeb33f commented 1 year ago

Can confirm this. On my 3090 I can only generate a single image with the hires fix before I get an OOM error when trying to generate any more images above 512x512. Of note is that in previous versions of WebUI I could usually generate an image at 512x512 with multiple ControlNet models and only use 7 gigs of VRAM per generation, which would immediately be cleared after image generation was complete. Now, after only a single image generation my VRAM usage stays at a minimum of and never goes below 10 gb, even when idle, until I close the terminal window.

Seems like a classic memory leak, and this latest revision of WebUI has a ton of errors beyond just this. Would recommend just reverting to an older commit until all this is fixed.

Thank you for your help, I will try to get back to a previous version and keep this post updated with the result

Kobeb33f commented 1 year ago

I had this issue as well and couldn't create any hires images at 512x512 even though it was working perfectly fine before.

After taking a look at it for a long time, messing with different optimization settings, and bashing my head into the source code wall, I found something that helped me out.

It seems like there may be some issues with the setting --no-half-vae, this setting causes Stable Diffusion to suddenly want more ram when it finishes processing an image. The jump in ram is tremendous, most of the time it just asks for an additional half or all of your vram. I'm not sure if --no-half works the same way, but try removing it.

Sometimes, you may get this error if you don't have --no-half-vae. modules.devices.NansException: A tensor with all NaNs was produced in VAE. This could be because there's not enough precision to represent the picture. Try adding --no-half-vae commandline argument to fix this. Use --disable-nan-check commandline argument to disable this check.

I used the --disable-nan-check and it worked out for me. If you're getting a black image, it might be because the checkpoint you're using doesn't like what you're trying to do. For example, I tried to make a mecha checkpoint using the prompt 1girl and just got a black screen.

If you're still just a little shy of having enough memory, which is ridiculous considering your vram, you can try putting this somewhere. PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128

I have it in my webui-user.bat, but I'm not sure where you'd put that if you run it from the command line like you are doing.

Lastly, as you said, using --opt-sub-quad-attention and --opt-split-attention-v1 can help you get it running, but I think you don't need to have --opt-split-attention-v1 if you use --opt-sub-quad-attention. When using --opt-sub-quad-attention you can then specify the chunk sizes which will allow you to allocate your memory accordingly to how much you have.

Here's an example of my webui-user.bat that I'm using and have optimized for my 970 gtx.

@echo off

set PYTHON=C:\Users\KMikiya\AppData\Local\Programs\Python\Python310\python.exe
set GIT=
set VENV_DIR=
set COMMANDLINE_ARGS=--medvram --disable-nan-check --opt-sub-quad-attention --sub-quad-q-chunk-size=128 --sub-quad-kv-chunk-size=128 --sub-quad-chunk-threshold=80
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128

git pull

call webui.bat

Like this, I'm able to put out hires images with a base resolution of 512x768 with only 4gb vram and no real hit to my already slow processing speed.

Since you have so much ram you can probably try raising the values of --sub-quad-q-chunk-size=128 and --sub-quad-kv-chunk-size=128 considerably. Maybe something like 1024 or 2048 on both would be optimal for you?

Hopefully, this helps you out, as I know how frustrating it was messing around with all these settings to get Stable Diffusion to run as it did before some of the recent updates.

Thank you @notkmikiya for your answer. Usualy I don't use any command to launch stable diffusion

Kobeb33f commented 1 year ago

@setothegreat : So I tried it but still the same issue. Somehow adding this argument : --opt-sub-quad-attention solve the issue but I don't know why I need to add this now

notkmikiya commented 1 year ago

@Kobeb33f What commit did you rollback to? If it's been going on for a few days you might want to go back to something around a9fed7c364061ae6efb37f797b6b522cb3cf7aa2. That was back on March 14th and seems to be when a lot of things were starting up.

If --opt-sub-quad-attention is solving your issue, there's probably a problem with the default memory allocation, possibly also the garbage collection.

Kobeb33f commented 1 year ago

@notkmikiya Yes I tried to come back to the a9fed7c

garyakimoto commented 1 year ago

@notkmikiya Yes I tried to come back to the a9fed7c

did it work? thanks

Kobeb33f commented 1 year ago

No sorry, it is not working, the only thing working so far is to add --opt-sub-quad-attention but it is reducing the GPU performance.

garyakimoto commented 1 year ago

No sorry, it is not working, the only thing working so far is to add --opt-sub-quad-attention but it is reducing the GPU performance.

thank you for your report , hope someone can fix it.

Lalimec commented 1 year ago

Same here, I was able to generate ~10 768x768 images, and now can even generate 4.

notkmikiya commented 1 year ago

@Kobeb33f Ok, managed to get ahold of a 2060 12GB to test some stuff out. Here are some results of the testing.

When testing on this: python: 3.10.6  •  torch: 1.13.1+cu117  •  xformers: 0.0.16rc425  •  gradio: 3.23.0  •  commit: [955df775](https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/955df7751eef11bb7697e2d77f6b8a6226b21e13)

I was able to get it working with a base resolution of 1024x1024, controlnet, lora, and using hires/fix at 2x using these settings in my webui-user.bat.

set COMMANDLINE_ARGS=--xformers
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512

The usage of vram was about 7.5GB on average with peaks near 12GB.

I tried changing the split size to 1024, which also worked, but had no difference in performance for me. This may make a difference for you since you have more vram. From the looks of it, it's possibly a pytorch garbage collection issue. With these settings, my vram usage would drop to around 3GB after every image. This was after testing repeatedly for an hour's worth when using "Repeat Forever" on generation.

On the other hand, when I tried using Torch 2 and --opt-sdp-attention/--opt-sdp-no-mem-attention instead of --xformers, it was a bit faster but ate more memory. I wasn't able to make a base image of 1024x1024, controlnet, lora, and hires/fix. It kept wanting about 6GB more vram than I had. It might work for you though since you have more vram. python: 3.10.6  •  torch: 2.0.0+cu118  •  xformers: N/A  •  gradio: 3.23.0  •  commit: [955df775](https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/955df7751eef11bb7697e2d77f6b8a6226b21e13)

set COMMANDLINE_ARGS=--opt-sdp-no-mem-attention
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512

I was able to get it working using --medvram and some other settings, but the hit to the speed were pretty huge.

set COMMANDLINE_ARGS=--medvram --disable-nan-check --opt-sub-quad-attention --sub-quad-q-chunk-size=128 --sub-quad-kv-chunk-size=128 --sub-quad-chunk-threshold=80
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128

For people with low vram (Not Kobeb33f), it's probably better to use fp16 checkpoints instead of fp32 whenever you can. As it will save you some vram and increase your it/s.

Lastly, I noticed a lot of extensions broke over the last few updates, so disabling all except controlnet might help some. At least until they are updated and aren't causing issues.

garyakimoto commented 1 year ago

@Kobeb33f Ok, managed to get ahold of a 2060 12GB to test some stuff out. Here are some results of the testing.

When testing on this: python: 3.10.6  •  torch: 1.13.1+cu117  •  xformers: 0.0.16rc425  •  gradio: 3.23.0  •  commit: [955df775](https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/955df7751eef11bb7697e2d77f6b8a6226b21e13)

I was able to get it working with a base resolution of 1024x1024, controlnet, lora, and using hires/fix at 2x using these settings in my webui-user.bat.

set COMMANDLINE_ARGS=--xformers
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512

The usage of vram was about 7.5GB on average with peaks near 12GB.

I tried changing the split size to 1024, which also worked, but had no difference in performance for me. This may make a difference for you since you have more vram. From the looks of it, it's possibly a pytorch garbage collection issue. With these settings, my vram usage would drop to around 3GB after every image. This was after testing repeatedly for an hour's worth when using "Repeat Forever" on generation.

On the other hand, when I tried using Torch 2 and --opt-sdp-attention/--opt-sdp-no-mem-attention instead of --xformers, it was a bit faster but ate more memory. I wasn't able to make a base image of 1024x1024, controlnet, lora, and hires/fix. It kept wanting about 6GB more vram than I had. It might work for you though since you have more vram. python: 3.10.6  •  torch: 2.0.0+cu118  •  xformers: N/A  •  gradio: 3.23.0  •  commit: [955df775](https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/955df7751eef11bb7697e2d77f6b8a6226b21e13)

set COMMANDLINE_ARGS=--opt-sdp-no-mem-attention
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512

I was able to get it working using --medvram and some other settings, but the hit to the speed were pretty huge.

set COMMANDLINE_ARGS=--medvram --disable-nan-check --opt-sub-quad-attention --sub-quad-q-chunk-size=128 --sub-quad-kv-chunk-size=128 --sub-quad-chunk-threshold=80
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128

For people with low vram (Not Kobeb33f), it's probably better to use fp16 checkpoints instead of fp32 whenever you can. As it will save you some vram and increase your it/s.

Lastly, I noticed a lot of extensions broke over the last few updates, so disabling all except controlnet might help some. At least until they are updated and aren't causing issues.

Thanks, so for 24gb VRAM's 3090, is that mean I can use this for optimize? any suggestion?

set COMMANDLINE_ARGS=--opt-sdp-no-mem-attention set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512

Thank you

notkmikiya commented 1 year ago

@Kobeb33f Ok, managed to get ahold of a 2060 12GB to test some stuff out. Here are some results of the testing. When testing on this: python: 3.10.6  •  torch: 1.13.1+cu117  •  xformers: 0.0.16rc425  •  gradio: 3.23.0  •  commit: [955df775](https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/955df7751eef11bb7697e2d77f6b8a6226b21e13) I was able to get it working with a base resolution of 1024x1024, controlnet, lora, and using hires/fix at 2x using these settings in my webui-user.bat.

set COMMANDLINE_ARGS=--xformers
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512

The usage of vram was about 7.5GB on average with peaks near 12GB. I tried changing the split size to 1024, which also worked, but had no difference in performance for me. This may make a difference for you since you have more vram. From the looks of it, it's possibly a pytorch garbage collection issue. With these settings, my vram usage would drop to around 3GB after every image. This was after testing repeatedly for an hour's worth when using "Repeat Forever" on generation. On the other hand, when I tried using Torch 2 and --opt-sdp-attention/--opt-sdp-no-mem-attention instead of --xformers, it was a bit faster but ate more memory. I wasn't able to make a base image of 1024x1024, controlnet, lora, and hires/fix. It kept wanting about 6GB more vram than I had. It might work for you though since you have more vram. python: 3.10.6  •  torch: 2.0.0+cu118  •  xformers: N/A  •  gradio: 3.23.0  •  commit: [955df775](https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/955df7751eef11bb7697e2d77f6b8a6226b21e13)

set COMMANDLINE_ARGS=--opt-sdp-no-mem-attention
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512

I was able to get it working using --medvram and some other settings, but the hit to the speed were pretty huge.

set COMMANDLINE_ARGS=--medvram --disable-nan-check --opt-sub-quad-attention --sub-quad-q-chunk-size=128 --sub-quad-kv-chunk-size=128 --sub-quad-chunk-threshold=80
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128

For people with low vram (Not Kobeb33f), it's probably better to use fp16 checkpoints instead of fp32 whenever you can. As it will save you some vram and increase your it/s. Lastly, I noticed a lot of extensions broke over the last few updates, so disabling all except controlnet might help some. At least until they are updated and aren't causing issues.

Thanks, so for 24gb VRAM's 3090, is that mean I can use this for optimize? any suggestion?

set COMMANDLINE_ARGS=--opt-sdp-no-mem-attention set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512

Thank you

If you are using torch: 2.0.0+cu118 I would recommend giving it a try, it's what I'm using and it's working great.

If you are using torch: 1.13.1+cu117 you can try using

set COMMANDLINE_ARGS=--xformers
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512

in your webui-user.bat.

If you don't want to use xformers, then you can probably take it off and still be alright.

To find out what version of torch you're using, look at the bottom of the Stable Diffusion page when you load it up. It should tell you there.

garyakimoto commented 1 year ago

@Kobeb33f Ok, managed to get ahold of a 2060 12GB to test some stuff out. Here are some results of the testing. When testing on this: python: 3.10.6  •  torch: 1.13.1+cu117  •  xformers: 0.0.16rc425  •  gradio: 3.23.0  •  commit: [955df775](https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/955df7751eef11bb7697e2d77f6b8a6226b21e13) I was able to get it working with a base resolution of 1024x1024, controlnet, lora, and using hires/fix at 2x using these settings in my webui-user.bat.

set COMMANDLINE_ARGS=--xformers
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512

The usage of vram was about 7.5GB on average with peaks near 12GB. I tried changing the split size to 1024, which also worked, but had no difference in performance for me. This may make a difference for you since you have more vram. From the looks of it, it's possibly a pytorch garbage collection issue. With these settings, my vram usage would drop to around 3GB after every image. This was after testing repeatedly for an hour's worth when using "Repeat Forever" on generation. On the other hand, when I tried using Torch 2 and --opt-sdp-attention/--opt-sdp-no-mem-attention instead of --xformers, it was a bit faster but ate more memory. I wasn't able to make a base image of 1024x1024, controlnet, lora, and hires/fix. It kept wanting about 6GB more vram than I had. It might work for you though since you have more vram. python: 3.10.6  •  torch: 2.0.0+cu118  •  xformers: N/A  •  gradio: 3.23.0  •  commit: [955df775](https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/955df7751eef11bb7697e2d77f6b8a6226b21e13)

set COMMANDLINE_ARGS=--opt-sdp-no-mem-attention
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512

I was able to get it working using --medvram and some other settings, but the hit to the speed were pretty huge.

set COMMANDLINE_ARGS=--medvram --disable-nan-check --opt-sub-quad-attention --sub-quad-q-chunk-size=128 --sub-quad-kv-chunk-size=128 --sub-quad-chunk-threshold=80
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128

For people with low vram (Not Kobeb33f), it's probably better to use fp16 checkpoints instead of fp32 whenever you can. As it will save you some vram and increase your it/s. Lastly, I noticed a lot of extensions broke over the last few updates, so disabling all except controlnet might help some. At least until they are updated and aren't causing issues.

Thanks, so for 24gb VRAM's 3090, is that mean I can use this for optimize? any suggestion? set COMMANDLINE_ARGS=--opt-sdp-no-mem-attention set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 Thank you

If you are using torch: 2.0.0+cu118 I would recommend giving it a try, it's what I'm using and it's working great.

If you are using torch: 1.13.1+cu117 you can try using

set COMMANDLINE_ARGS=--xformers
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512

in your webui-user.bat.

If you don't want to use xformers, then you can probably take it off and still be alright.

To find out what version of torch you're using, look at the bottom of the Stable Diffusion page when you load it up. It should tell you there.

Thank you so much, I will try it later. Thanks

Kobeb33f commented 1 year ago

Hi @notkmikiya ,

thank you so much for your answer !!

I tried this :

set COMMANDLINE_ARGS=--xformers
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512

I got an error :

ValueError: Query/Key/Value should all have the same dtype 
query.dtype: torch.float32 
key.dtype : torch.float32 
value.dtype: torch.float16 

So i had to uncheck a parameter in the config of stable diffusion :

Settings > Stable Diffusion >Upcast cross attention layer to float32

and... it works perfectly and very fastly !!

Thank you !!

notkmikiya commented 1 year ago

Hi @notkmikiya ,

thank you so much for your answer !!

I tried this :

set COMMANDLINE_ARGS=--xformers set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512

I got an error :

ValueError: Query/Key/Value should all have the same dtype query.dtype: torch.float32 key.dtype : torch.float32 value.dtype: torch.float16

So i had to uncheck a parameter in the config of stable diffusion :

Settings > Stable Diffusion >Upcast cross attention layer to float32

and... it works perfectly and very fastly !!

Thank you !!

Glad that helped you out!

It doesn't fix the fact that Stable Diffusion is eating more and more vram than before for no known reason, but at least you're back on track for making things.

Gcode808 commented 1 year ago

Hi @notkmikiya ,

thank you so much for your answer !!

I tried this :

set COMMANDLINE_ARGS=--xformers
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512

I got an error :

ValueError: Query/Key/Value should all have the same dtype 
query.dtype: torch.float32 
key.dtype : torch.float32 
value.dtype: torch.float16 

So i had to uncheck a parameter in the config of stable diffusion :

Settings > Stable Diffusion >Upcast cross attention layer to float32

and... it works perfectly and very fastly !!

Thank you !!

After attempting this, it appeared to function for roughly 35 minutes before ultimately crashing. This issue only emerged recently, as it occurred for the first time yesterday. I had been batch-processing images at 768x1024 resolution for weeks without any problems until then. After discovering this thread, I have been trying to resolve the issue and get the processing to run smoothly again. While this solution appeared to work briefly, the crashing has resumed. For context, I am utilizing a GeForce RTX 3060 Twin Edge OC 12GB GDDR6. I have bookmarked this page in the hopes that a resolution will be available soon.

notkmikiya commented 1 year ago

After attempting this, it appeared to function for roughly 35 minutes before ultimately crashing. This issue only emerged recently, as it occurred for the first time yesterday. I had been batch-processing images at 768x1024 resolution for weeks without any problems until then. After discovering this thread, I have been trying to resolve the issue and get the processing to run smoothly again. While this solution appeared to work briefly, the crashing has resumed. For context, I am utilizing a GeForce RTX 3060 Twin Edge OC 12GB GDDR6. I have bookmarked this page in the hopes that a resolution will be available soon.

There's some memory leaks or over allocating of memory in some areas with a lack of good garbage collection overall it seems. It wants to allocate about a third of my vram or more after an image processes. The spike in allocated memory just kills it for me.

Considering you're doing 768x1024, you're probably at the cusp of hitting your vram after the updates this month. Here's how it looks for me on a 2060 with 12GB vram making a 768x1024 image. It's all nice and easy until the end... Then the hires/fix part basically kills it.

image image

Arguments: ('task(zit1d60bztvorxc)', 'SORCERESS,\nWITCH HAT,\nWITCH DRESS, \n<lora:sorceressDragonsCrown_v10:1>', '', ['0 Easy Deep Negative'], 20, 16, False, False, 1, 1, 7, -1.0, -1.0, 0, 0, 0, False, 1024, 768, True, 0.5, 2, 'Remacri Upscaler', 0, 0, 0, [], 0, True, False, 1, False, False, False, 1.1, 1.5, 100, 0.7, False, False, True, False, False, 0, 'Gustavosta/MagicPrompt-Stable-Diffusion', '', <scripts.external_code.ControlNetUnit object at 0x00000234EB148310>, <scripts.external_code.ControlNetUnit object at 0x00000234EB14BE50>, <scripts.external_code.ControlNetUnit object at 0x00000234EB148430>, False, False, 'positive', 'comma', 0, False, False, '', 1, '', 0, '', 0, '', True, False, False, False, 0, None, False, None, False, None, False, 50) {}
Traceback (most recent call last):
  File "C:\Users\KMikiya\stable-diffusion-webui\modules\call_queue.py", line 56, in f
    res = list(func(*args, **kwargs))
  File "C:\Users\KMikiya\stable-diffusion-webui\modules\call_queue.py", line 37, in f
    res = func(*args, **kwargs)
  File "C:\Users\KMikiya\stable-diffusion-webui\modules\txt2img.py", line 56, in txt2img
    processed = process_images(p)
  File "C:\Users\KMikiya\stable-diffusion-webui\modules\processing.py", line 503, in process_images
    res = process_images_inner(p)
  File "C:\Users\KMikiya\stable-diffusion-webui\modules\processing.py", line 653, in process_images_inner
    samples_ddim = p.sample(conditioning=c, unconditional_conditioning=uc, seeds=seeds, subseeds=subseeds, subseed_strength=p.subseed_strength, prompts=prompts)
  File "C:\Users\KMikiya\stable-diffusion-webui\modules\processing.py", line 922, in sample
    samples = self.sd_model.get_first_stage_encoding(self.sd_model.encode_first_stage(decoded_samples))
  File "C:\Users\KMikiya\stable-diffusion-webui\modules\sd_hijack_utils.py", line 17, in <lambda>
    setattr(resolved_obj, func_path[-1], lambda *args, **kwargs: self(*args, **kwargs))
  File "C:\Users\KMikiya\stable-diffusion-webui\modules\sd_hijack_utils.py", line 28, in __call__
    return self.__orig_func(*args, **kwargs)
  File "C:\Users\KMikiya\stable-diffusion-webui\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\KMikiya\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 830, in encode_first_stage
    return self.first_stage_model.encode(x)
  File "C:\Users\KMikiya\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\models\autoencoder.py", line 83, in encode
    h = self.encoder(x)
  File "C:\Users\KMikiya\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\KMikiya\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\model.py", line 536, in forward
    h = self.mid.attn_1(h)
  File "C:\Users\KMikiya\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\KMikiya\stable-diffusion-webui\modules\sd_hijack_optimizations.py", line 498, in sdp_no_mem_attnblock_forward
    return sdp_attnblock_forward(self, x)
  File "C:\Users\KMikiya\stable-diffusion-webui\modules\sd_hijack_optimizations.py", line 490, in sdp_attnblock_forward
    out = torch.nn.functional.scaled_dot_product_attention(q, k, v, dropout_p=0.0, is_causal=False)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.50 GiB (GPU 0; 12.00 GiB total capacity; 10.93 GiB already allocated; 0 bytes free; 11.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The weird thing is, it feels like it's double allocating the memory. So instead of hitting you with that 4.50 GiB like it says, it's hitting you with a good 9.0 GiB in total.

If I want to get past that, I can use --xformers instead of --opt-sdp-no-mem-attention which seems to be what you did, but the vram is still tremendously close to the that 12GB limit.

You'll probably end up having to use --medvram if you want to have fewer crashes. You can also try messing around with some of these settings and tweak them to slightly higher levels to not lose too much performance.

set COMMANDLINE_ARGS=--xformers --medvram --opt-sub-quad-attention --sub-quad-q-chunk-size=128 --sub-quad-kv-chunk-size=128 --sub-quad-chunk-threshold=80
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128

Until they get a fix for some of the memory leaks and heavy memory allocation, it's either that or rolling back to an earlier version?

garyakimoto commented 1 year ago

I think this is really Automatic1111 version problem, because I was try to use kohya_ss they really speed up my training and photo create and I can set batch size > 1 , but for Automatic1111, If I set batch size > 1, it always get erros.

garyakimoto commented 1 year ago

I've tried everything and I'm getting this error too. There is something wonky in the program that starts collecting stray memory. I've downloaded torch 2 and 2.1 with cu117 and cu118 in all combinations. I've installed one extension one a time, and for a hot minute there I thought it was ControlNet because it was the only thing our builds had in common; but no. That particular install from the main branch just crashed and burned immediately for whatever the underlying issue is. I read somewhere else in "Issues" that this problem started happening around March 14, but also found one from February 20 and another from November of last year with the same issue. I don't know if any of the devs are trying to fix it, or if they simply can't.

Right now my only advice would be to keep a zip of the code, keep track of all your settings, keep backups of your tensors in a separate folder, set everything up in such a way that you aren't dependent on the webui file system, and when this happens again you'll simply need to trash the entire build and do a fresh install; because even emptying the cache from the command line doesn't work.

I hope someone fixes this problem soon; I am feeling very frustrated with this whole situation.

Side Note: I just wish it didn't download the v1.5 SD tensor every time I install it. Maybe as a stop gap someone knows how prevent that from happening, to save us all some precious time?

Are you 3090 + ? because under 3090 you can't get that high batch size.

Why don't you redownload all of them in a new folder and try again?

This is my system version: python: 3.10.7 torch: 1.13.1+cu117 xformers: 0.0.17.dev464

I was stuck here before and fixed by this: Also chane accelerate==0.12.0 inside "requirements_versions.txt" to accelerate==0.16.0 , After change... you need to close them all and start again.

hope this can help you.

superfid2006 commented 1 year ago

I was led to this topic after I got an E-Mail informing me that an issue I participated in was mentioned: https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/8756#discussioncomment-5387184

The OP mentioned having Composable LoRA installed. I suggest getting rid of it, maybe also sd-webui-locon (better to get rid of only the first one at first though) I had trouble with assigned VRAM ballooning over time, necessitating restarts from time to time. After getting rid of this extension, I also didn't have this problem any more.

https://github.com/opparco/stable-diffusion-webui-composable-lora https://github.com/KohakuBlueleaf/a1111-sd-webui-locon

Gcode808 commented 1 year ago

Hope this helps, I set my virtual memory size to 'System managed'. As I mentioned previously, SD just stopped working for me and I thought it was a memory leak with stable diffusion, however, I was getting similar error when trying to install oobabooga and To address the issue on Windows related to insufficient memory, you can set the virtual memory size to 'System managed'.

This involves configuring the pagefile.sys file, which is used as temporary storage when the system runs out of RAM. By setting the virtual memory size to 'System managed', Windows can automatically adjust the size of the pagefile.sys file based on the available system resources.

To set the virtual memory size to 'System managed', follow these steps:

Doing this, I haven't had SD crash since, I was also able to get oobabooga installed with vicuna model. Just thought I'd update here. This is a solution you can try if you're still having cuda out of memory errors.

dev-dedev commented 1 year ago

Just curious - what's the it/s you get during benchmark?