Mikubill / sd-webui-controlnet

WebUI extension for ControlNet
GNU General Public License v3.0
16.93k stars 1.95k forks source link

[Bug]: RAM Memory leak issue - RAM consumption keeps increasing #990

Open ghpkishore opened 1 year ago

ghpkishore commented 1 year ago

Is there an existing issue for this?

What happened?

When trying to use the extension without any model cache being switched on, after 10 tries, the SSH connection to my EC2 instance fails. This always happens during build_controlnet_model function, because the output : Loading model {model} gets printed but the loading state_dict does not.

Steps to reproduce the problem

  1. Use AWS EC2 instance
  2. I am running bash webui.sh with which the entire program starts and set model cache off, apply settings and restart without any args
  3. Generate images using img2img roughly 10 times
  4. Somewhere inbetween the SSH connection gets disconneted exactly after stating loading model [model name] - this is not limited to one model, happens across all models.

What should have happened?

The SSH connection should not have gotten closed. There seems to be some error here. It happens suddenly as well without any proper reproducible number of steps. But it does fail. It doesn't fail for normal automatic1111, and is a controlnet web-ui problem

Commit where the problem happens

webui: [22bcc7be] controlnet: 2270f364e167b9531daf9a8bd1d62cb2dbfa4d00

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

No

Console logs

Usually it is supposed to be:


Loading model: control_v11p_sd15_inpaint [ebff9138]
Loaded state_dict from [/home/ec2-user/stable-diffusion-webui/models/ControlNet/control_v11p_sd15_inpaint.pth]
Loading config: /home/ec2-user/stable-diffusion-webui/extensions/sd-webui-controlnet/models/control_v11p_sd15_inpaint.yaml
ControlNet model control_v11p_sd15_inpaint [ebff9138] loaded.

But while it fails it stops at:


Loading model: control_v11p_sd15_inpaint [ebff9138] 

The rest of the log isn't visible and SSH gets disconnected.

Additional information

Happens with model cache as well. But after many more tries

lllyasviel commented 1 year ago

can you track your memory use

ghpkishore commented 1 year ago

I did. I still had close to 11GB of VRAM available.

Adding the type of log which I got:

==============NVSMI LOG==============

Timestamp : Sun Apr 23 14:05:44 2023 Driver Version : 515.65.01 CUDA Version : 11.7

Attached GPUs : 1 GPU 00000000:00:1E.0 FB Memory Usage Total : 15360 MiB Reserved : 388 MiB Used : 7109 MiB Free : 7861 MiB BAR1 Memory Usage Total : 256 MiB Used : 5 MiB Free : 251 MiB

currently the program is running. And I test every 5 seconds. When the SSH connection got lost, the Free Memory was : ~11500 MiB

lllyasviel commented 1 year ago

please track your memory use, not GPU memory use.

ghpkishore commented 1 year ago

@lllyasviel SSH Failed Again when running from model cache. The prompt and other input params are below. I ran the same inputs with different seeds and it failed the 14th time this time around. This is with cache on.

Handsome Indian man wearing red colour specs Steps: 20, Sampler: Euler a, CFG scale: 7, Seed: 562784315, Size: 512x512, Model hash: 6ce0161689, Model: v1-5-pruned-emaonly, Denoising strength: 0.75, Mask blur: 4, ControlNet-0 Enabled: True, ControlNet-0 Module: depth_zoe, ControlNet-0 Model: control_v11f1p_sd15_depth [cfd03158], ControlNet-0 Weight: 1, ControlNet-0 Guidance Start: 0, ControlNet-0 Guidance End: 1, ControlNet-1 Enabled: True, ControlNet-1 Module: canny, ControlNet-1 Model: control_v11p_sd15_canny [d14c016b], ControlNet-1 Weight: 1, ControlNet-1 Guidance Start: 0, ControlNet-1 Guidance End: 1, ControlNet-2 Enabled: True, ControlNet-2 Module: softedge_pidinet, ControlNet-2 Model: control_v11p_sd15_softedge [a8575a2a], ControlNet-2 Weight: 1, ControlNet-2 Guidance Start: 0, ControlNet-2 Guidance End: 1

processing | 138.8/7.0s Time taken: 27.07sTorch active/reserved: 5469/5814 MiB, Sys VRAM: 6836/14972 MiB (45.66%)

Kept running the same set of models and param again and again. Then it failed at 14th try.

Console Log:

Loading model from cache: control_v11f1p_sd15_depth [cfd03158]██████████████| 16/16 [00:22<00:00,  1.48s/it]
Loading preprocessor: depth_zoe
Pixel Perfect Mode Enabled.
resize_mode = ResizeMode.RESIZE
raw_H = 585
raw_W = 585
target_H = 512
target_W = 512
estimation = 512.0
preprocessor resolution = 512
Loading model from cache: control_v11p_sd15_canny [d14c016b]
Loading preprocessor: canny
preprocessor resolution = 512
Loading model from cache: control_v11p_sd15_softedge [a8575a2a]
Loading preprocessor: pidinet
Pixel Perfect Mode Enabled.
resize_mode = ResizeMode.RESIZE
raw_H = 585
raw_W = 585
target_H = 512
target_W = 512
estimation = 512.0
preprocessor resolution = 512
  0%|                                                                                | 0/16 [00:00<?, ?it/s]
ghpkishore commented 1 year ago

Okay, when you say memory use do you mean systems RAM and total amount of space in it?

lllyasviel commented 1 year ago

yes, RAM

ghpkishore commented 1 year ago

@lllyasviel you were right. There is an issue with the RAM. I have attached a screenshot of the mem usage below. Is there any way to fix this? I am not running any other program other than controlnet + automatic1111. Basically, everytime, I click generate the RAM usage increases. After it reached 90%, I tried once again and it failed. Please, let me know how to proceed. Thanks. I am using a g4dn instance so it has 16 GB RAM

Screenshot 2023-04-23 at 10 09 05 PM
lllyasviel commented 1 year ago

it seems a memory leak in some preprocessor

ghpkishore commented 1 year ago

If you have any idea on how to solve this please let me know, I will try to see if it fixes the issue.

I am also going to conduct some tests with different processors and see which might be the issue. I will also try without any processors and see if that doesn't lead to constant increase in RAM usage. Will get back with those results.

ghpkishore commented 1 year ago

@lllyasviel It has something to do with Low VRAM setting as per my initial observation. I generated images back to back for 10 mins with canny edge model and preprocessor. It had a very stable memory consumption. All the images below have the same time axis on X.

Screenshot 2023-04-24 at 10 55 08 AM

And then I switched on the Low VRAM setting and got the following memory consumption.

Screenshot 2023-04-24 at 11 05 57 AM

I had very similar behaviour with canny + depth_zoe when lowVRAM was switch on for depth Zoe.

Screenshot 2023-04-24 at 10 34 24 AM

I will be checking again with other processors without VRAM setting and check if the RAM consumption is stable. Let me know your views.

lllyasviel commented 1 year ago

thanks for the data we will take a look soon

ghost commented 1 year ago

just an idea: I had previously reported that for clip_vision there was a memory leak due to not using "with torch.no_grad() :". This was for CN1.0, I'm not sure if it's already added, may apply to other annotators.

ghpkishore commented 1 year ago

@tkalayci71 can you explain where this might need to be added for me to check? The file or folder or more information would prove very useful.

ghpkishore commented 1 year ago

@lllyasviel I have been running the code for more than an hour without low VRAM setting and till now, there has been no issue at all with random increase in system RAM consumption. Strongly feel this might be an implementation issue of Low VRAM.

ghost commented 1 year ago

@ghpkishore it's in annotator / clip / init py, apply function last 3 lines need to be wrapped inside torch.no_grad, but I wouldn't recommend modifying code, they'll probably will solve it soon.

@lllyasviel by the way, see also: https://github.com/huggingface/transformers/issues/20636

ghpkishore commented 1 year ago

@lllyasviel similar to how the program is killed if VRAM gets over, is it possible to add a check for system RAM as well. As in incase if the system RAM exceeds 95% kill the program?

mykeehu commented 1 year ago

I wrote here that even if I turned off Controlnet, clearing the Enabled checkbox does not delete the model from VRAM, it still has +3 GB! https://github.com/vladmandic/automatic/discussions/386#discussioncomment-5762338

lllyasviel commented 1 year ago

I do not think cn has vram leak problem. If pytorch moves model out of gpu, it will not clear the vram - it just marks those vram as unoccupied and all other codes can use those vram even if those vram looks occupied in OS monitor. but cn may have some ram issue and we will take a look considering our workloads

bghira commented 1 year ago

a possible test would be to mock the pytorch modules so that they perform no-ops, or, basic trivial operations, while measuring memory use.

if we can measure the memory use while using pytorch in a profiler and then, measure it when mocking pytorch, it would possibly help. but i don't know enough about the internals, to pull this off.

hikmet-koyuncu commented 1 year ago

Hi,

I created highly optimized ControlNet v1.1.232 version. You can use this version with 4GB VRAM with max 2 Multi ControlNet and Hires. fix. All added and changed parts signed with "Hikmet Koyuncu".

Extract "webui" directory on your AUTOMATIC1111 "webui" directory and overwrite files.

You must firstly convert your ControlNet preprocessor and ControlNet models to fp16 format.

For ControlNet models you can use my edited "extract_controlnet.py" file. You must use "--half" and "--convert" arguments.

For ControlNet proprecessors (annotator) you can use "convert_controlnet_preprocessor_fp16.py" file.

Example:

python.exe "convert_controlnet_preprocessor_fp16.py" --src "myPreprocessor.pth" --dst "myPreprocessor_fp16.pth"

Link: https://www.mediafire.com/file/ihjr4gcg2wy2fm1/Optimized_ControlNet_v1.1.232_by_Hikmet_Koyuncu.zip/file

hungtooc commented 11 months ago

@lllyasviel, I've been having RAM problems for a long time, and recently it became quite serious since I increased the number of controlNet modules. Specifically,

I tried adding 10G of swap memory, but it's still full RAM soon. This is a metric that tracks RAM usage in percentage in last 7 days: image

P/s: I'm pretty sure the problem is controlNet, because I have another server that doesn't use controlNet, it always creates images continuously but still doesn't have the problem of full RAM.

hikmet-koyuncu commented 11 months ago

ControlNet load models in the VRAM but does not remove. And each time your VRAM usage increase. I published fixed version.

hungtooc commented 11 months ago

ControlNet load models in the VRAM but does not remove. And each time your VRAM usage increase. I published fixed version.

Hi @hikmet-koyuncu, My VRAM is fine, but RAM is not in my case. As the title of this issue, this is a RAM problem. thanks

hikmet-koyuncu commented 11 months ago

Yes. ControlNet move some models VRAM to RAM (some models not, it is a bug) after image creation, but never remove. I fixed this problem.

hungtooc commented 11 months ago

Yes. ControlNet move some models VRAM to RAM (some models not, it is a bug) after image creation, but never remove. I fixed this problem.

Hi @hikmet-koyuncu, after updating control extension to https://github.com/Mikubill/sd-webui-controlnet/commit/fce6775a6dddef52ecd658259e909687d9dedf72, the memory leak issue is still not resolved. More specifically on how I use ControlNet via API:

hikmet-koyuncu commented 11 months ago

Hi,

I added "Broom" icon. If you click it, RAM and VRAM will be clear.

I don't want everytime clear RAM, because this can slow our workflow. When you get RAM error, then click broom icon. This clear VRAM and RAM, and print RAM and VRAM amount at this time in the DOS Console window.

And, if you using fp32 models and you have small amount of RAM, then you must use fp16 models. You can convert fp32 models to fp16 models. I shared this python program too.

I am using 16 GB RAM and 4 GB VRAM and I can use 2 ControlNet Unit same time.

hungtooc commented 11 months ago

Hi @hikmet-koyuncu, please make a fork or contribute to this repo and I can take a look at your code

hikmet-koyuncu commented 11 months ago

Hi,

I don't know using GitHub too much. When I have a free time, I will learn. I can send you my edited version of "ControlNet 1.1.232". I added comment "Hikmet Koyuncu" on each changed part.

hungtooc commented 11 months ago

hi @hikmet-koyuncu, The code you uploaded to mediafire seems to be old (2023-07-18), and it's missing some code so I can't run your code yet. Can you upload the full update?

hikmet-koyuncu commented 11 months ago

Yes, because I uploaded long long ago, but nobody cared this. I am still using this version.

hungtooc commented 11 months ago

Hi @lllyasviel,
It's been quite a while since this issue was reported, can you share with me some tools or direction of investigation to find out which part is leaking memory?

nchaly commented 9 months ago

@hungtooc I thought it can be related to parameters change, but most likely the leak occurs after adding several ControlNet units. I've made a screen capture - https://drive.google.com/file/d/1l78ZkVJQx3E4S2Q9i61fasOJTt1feIZ0/view?usp=sharing - it starts leaking after 3:15 time or so.

huchenlei commented 9 months ago

@nchaly Thanks for the reproduction of the issue! I am going to take a deeper look into this.

huchenlei commented 9 months ago

I added some tracemalloc profiling code. First run log:

2023-12-30 22:05:23,691 - ControlNet - INFO - After generation:███████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:15<00:00,  1.35it/s]
2023-12-30 22:05:24,021 - ControlNet - INFO - D:\stable-diffusion-webui\extensions\sd-webui-controlnet\scripts\controlnet.py:843: size=11.9 MiB (+11.9 MiB), count=5 (+5), average=2430 KiB
2023-12-30 22:05:24,022 - ControlNet - INFO - D:\stable-diffusion-webui\modules\processing.py:908: size=1728 KiB (+1728 KiB), count=2 (+2), average=864 KiB
2023-12-30 22:05:24,024 - ControlNet - INFO - D:\stable-diffusion-webui\extensions\sd-webui-controlnet\scripts\controlnet.py:1150: size=1728 KiB (+1728 KiB), count=2 (+2), average=864 KiB
2023-12-30 22:05:24,026 - ControlNet - INFO - D:\stable-diffusion-webui\extensions\sd-webui-controlnet\scripts\processor.py:14: size=910 KiB (+910 KiB), count=4 (+4), average=228 KiB
2023-12-30 22:05:24,028 - ControlNet - INFO - C:\Users\hcl\AppData\Local\Programs\Python\Python310\lib\linecache.py:137: size=732 KiB (+732 KiB), count=7122 (+7122), average=105 B
2023-12-30 22:05:24,031 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:461: size=316 KiB (+316 KiB), count=1497 (+1497), average=216 B
2023-12-30 22:05:24,031 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:468: size=309 KiB (+309 KiB), count=1734 (+1734), average=183 B
2023-12-30 22:05:24,032 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:458: size=286 KiB (+286 KiB), count=2470 (+2470), average=118 B
2023-12-30 22:05:24,033 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:473: size=187 KiB (+187 KiB), count=1497 (+1497), average=128 B
2023-12-30 22:05:24,036 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:472: size=187 KiB (+187 KiB), count=1497 (+1497), average=128 B

After first generation

2023-12-30 22:07:36,321 - ControlNet - INFO - After generation:███████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:14<00:00,  1.26it/s]
2023-12-30 22:07:36,439 - ControlNet - INFO - D:\stable-diffusion-webui\modules\processing.py:908: size=1728 KiB (+1728 KiB), count=2 (+2), average=864 KiB
2023-12-30 22:07:36,440 - ControlNet - INFO - D:\stable-diffusion-webui\extensions\sd-webui-controlnet\scripts\controlnet.py:1150: size=1728 KiB (+1728 KiB), count=2 (+2), average=864 KiB
2023-12-30 22:07:36,440 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:468: size=189 KiB (+189 KiB), count=995 (+995), average=195 B
2023-12-30 22:07:36,441 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:461: size=179 KiB (+179 KiB), count=847 (+847), average=216 B
2023-12-30 22:07:36,441 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:1622: size=156 KiB (+156 KiB), count=144 (+144), average=1112 B
2023-12-30 22:07:36,442 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:458: size=148 KiB (+148 KiB), count=1319 (+1319), average=115 B
2023-12-30 22:07:36,442 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:473: size=106 KiB (+106 KiB), count=847 (+847), average=128 B
2023-12-30 22:07:36,442 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:472: size=106 KiB (+106 KiB), count=847 (+847), average=128 B
2023-12-30 22:07:36,443 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:471: size=106 KiB (+106 KiB), count=847 (+847), average=128 B
2023-12-30 22:07:36,443 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:470: size=106 KiB (+106 KiB), count=847 (+847), average=128 B

The first generation caches the preprocessor result, so you see a 11MB increase, but it does not happen again after the first run. I do not see anything else significant enough to be considered as an observable memory leak.

huchenlei commented 9 months ago

@nchaly I cannot reproduce your result locally. When I turn on multiple ControlNet units, the memory usage of the A1111 process fully recovers after each generation.

Can you run your local setup with --controlnet-tracemalloc commandline arg?

nchaly commented 9 months ago

@huchenlei thank you for looking into this.

I've updated controlnet to latest main version - I'm able to reproduce the issue. A1111 is latest master branch too.

Here is the summary of what I do:

  1. generate base image without ControlNet.
  2. add ControlNet unit 0, with "canny" setup, generate image once, then second time - everything is fine here.
  3. add ControlNet unit 1, with "depth" setup, generate image once.
  4. generate second image without any changes.

Step 4 here is where extra loading is happening. image

I presume that adding second unit somehow impacts caching mechanisms.

If I filter log with only "loading model" lines, it is suspicious that after fist usage of "canny" the model is loaded "from cache", but after adding second unit, "loading from cache" is not logged:

# step 2 first time
2023-12-31 12:54:54,121 - ControlNet - INFO - Loading model: control_sd15_canny [fef5e48e]
# step 2 second time
2023-12-31 12:55:06,385 - ControlNet - INFO - Loading model from cache: control_sd15_canny [fef5e48e]

# step 3
2023-12-31 12:55:33,246 - ControlNet - INFO - Loading model from cache: control_sd15_canny [fef5e48e]
2023-12-31 12:55:33,450 - ControlNet - INFO - Loading model: coadapter-depth-sd15v1 [93aff3ab]

# step 4 and subsequent generations.
2023-12-31 12:55:47,014 - ControlNet - INFO - Loading model: control_sd15_canny [fef5e48e]
2023-12-31 12:55:50,920 - ControlNet - INFO - Loading model: coadapter-depth-sd15v1 [93aff3ab]
2023-12-31 12:56:03,842 - ControlNet - INFO - Loading model: control_sd15_canny [fef5e48e]
2023-12-31 12:56:07,614 - ControlNet - INFO - Loading model: coadapter-depth-sd15v1 [93aff3ab]

Here is the full log:

log2.txt

hikmet-koyuncu commented 9 months ago

I wrote and fixed this issue in my example version. ControlNet loads models and does not remove models from VRAM and RAM. 32-bit models take up a lot of space in the RAM and VRAM. If models remove from RAM and VRAM after the image creation process, if use 16-bit models rather than 32-bit models and if a button add for remove RAM and VRAM (I did it for my example version), this problem fixed.

huchenlei commented 9 months ago

@huchenlei thank you for looking into this.

I've updated controlnet to latest main version - I'm able to reproduce the issue. A1111 is latest master branch too.

Here is the summary of what I do:

  1. generate base image without ControlNet.
  2. add ControlNet unit 0, with "canny" setup, generate image once, then second time - everything is fine here.
  3. add ControlNet unit 1, with "depth" setup, generate image once.
  4. generate second image without any changes.

Step 4 here is where extra loading is happening. image

I presume that adding second unit somehow impacts caching mechanisms.

If I filter log with only "loading model" lines, it is suspicious that after fist usage of "canny" the model is loaded "from cache", but after adding second unit, "loading from cache" is not logged:

# step 2 first time
2023-12-31 12:54:54,121 - ControlNet - INFO - Loading model: control_sd15_canny [fef5e48e]
# step 2 second time
2023-12-31 12:55:06,385 - ControlNet - INFO - Loading model from cache: control_sd15_canny [fef5e48e]

# step 3
2023-12-31 12:55:33,246 - ControlNet - INFO - Loading model from cache: control_sd15_canny [fef5e48e]
2023-12-31 12:55:33,450 - ControlNet - INFO - Loading model: coadapter-depth-sd15v1 [93aff3ab]

# step 4 and subsequent generations.
2023-12-31 12:55:47,014 - ControlNet - INFO - Loading model: control_sd15_canny [fef5e48e]
2023-12-31 12:55:50,920 - ControlNet - INFO - Loading model: coadapter-depth-sd15v1 [93aff3ab]
2023-12-31 12:56:03,842 - ControlNet - INFO - Loading model: control_sd15_canny [fef5e48e]
2023-12-31 12:56:07,614 - ControlNet - INFO - Loading model: coadapter-depth-sd15v1 [93aff3ab]

Here is the full log:

log2.txt

Thanks for the log message. I think the log here is normal, as by default control_net_model_cache_size is set to 1. When you load the depth model, the canny model will be ejected from the cache. Can you try set control_net_model_cache_size to 2 and see if it makes any difference? @nchaly

    shared.opts.add_option("control_net_model_cache_size", shared.OptionInfo(
        1, "Model cache size (requires restart)", gr.Slider, {"minimum": 1, "maximum": 10, "step": 1}, section=section))
nchaly commented 9 months ago

Yep, that helps, now both load from cache.

Soulreaver90 commented 7 months ago

Bumping this up. Been using faceid and noticed terrible memory leak when using alot of photos in multi-input. If I use a few, the ram seems to bounce back just fine. However when I use a ton and continually run new batches, I see my ram tank and eventually restort to using my swapfile. Once both ram and swapfile are used up, my computer hard locks completely. Without the swapfile, several generations will hardlock my pc. Been using webui for over a year, only experienced this issue when running controlnet and specifically faceid with multi-input.