Open ghpkishore opened 1 year ago
can you track your memory use
I did. I still had close to 11GB of VRAM available.
Adding the type of log which I got:
==============NVSMI LOG==============
Timestamp : Sun Apr 23 14:05:44 2023 Driver Version : 515.65.01 CUDA Version : 11.7
Attached GPUs : 1 GPU 00000000:00:1E.0 FB Memory Usage Total : 15360 MiB Reserved : 388 MiB Used : 7109 MiB Free : 7861 MiB BAR1 Memory Usage Total : 256 MiB Used : 5 MiB Free : 251 MiB
currently the program is running. And I test every 5 seconds. When the SSH connection got lost, the Free Memory was : ~11500 MiB
please track your memory use, not GPU memory use.
@lllyasviel SSH Failed Again when running from model cache. The prompt and other input params are below. I ran the same inputs with different seeds and it failed the 14th time this time around. This is with cache on.
Handsome Indian man wearing red colour specs Steps: 20, Sampler: Euler a, CFG scale: 7, Seed: 562784315, Size: 512x512, Model hash: 6ce0161689, Model: v1-5-pruned-emaonly, Denoising strength: 0.75, Mask blur: 4, ControlNet-0 Enabled: True, ControlNet-0 Module: depth_zoe, ControlNet-0 Model: control_v11f1p_sd15_depth [cfd03158], ControlNet-0 Weight: 1, ControlNet-0 Guidance Start: 0, ControlNet-0 Guidance End: 1, ControlNet-1 Enabled: True, ControlNet-1 Module: canny, ControlNet-1 Model: control_v11p_sd15_canny [d14c016b], ControlNet-1 Weight: 1, ControlNet-1 Guidance Start: 0, ControlNet-1 Guidance End: 1, ControlNet-2 Enabled: True, ControlNet-2 Module: softedge_pidinet, ControlNet-2 Model: control_v11p_sd15_softedge [a8575a2a], ControlNet-2 Weight: 1, ControlNet-2 Guidance Start: 0, ControlNet-2 Guidance End: 1
processing | 138.8/7.0s Time taken: 27.07sTorch active/reserved: 5469/5814 MiB, Sys VRAM: 6836/14972 MiB (45.66%)
Kept running the same set of models and param again and again. Then it failed at 14th try.
Console Log:
Loading model from cache: control_v11f1p_sd15_depth [cfd03158]██████████████| 16/16 [00:22<00:00, 1.48s/it]
Loading preprocessor: depth_zoe
Pixel Perfect Mode Enabled.
resize_mode = ResizeMode.RESIZE
raw_H = 585
raw_W = 585
target_H = 512
target_W = 512
estimation = 512.0
preprocessor resolution = 512
Loading model from cache: control_v11p_sd15_canny [d14c016b]
Loading preprocessor: canny
preprocessor resolution = 512
Loading model from cache: control_v11p_sd15_softedge [a8575a2a]
Loading preprocessor: pidinet
Pixel Perfect Mode Enabled.
resize_mode = ResizeMode.RESIZE
raw_H = 585
raw_W = 585
target_H = 512
target_W = 512
estimation = 512.0
preprocessor resolution = 512
0%| | 0/16 [00:00<?, ?it/s]
Okay, when you say memory use do you mean systems RAM and total amount of space in it?
yes, RAM
@lllyasviel you were right. There is an issue with the RAM. I have attached a screenshot of the mem usage below. Is there any way to fix this? I am not running any other program other than controlnet + automatic1111. Basically, everytime, I click generate the RAM usage increases. After it reached 90%, I tried once again and it failed. Please, let me know how to proceed. Thanks. I am using a g4dn instance so it has 16 GB RAM
it seems a memory leak in some preprocessor
If you have any idea on how to solve this please let me know, I will try to see if it fixes the issue.
I am also going to conduct some tests with different processors and see which might be the issue. I will also try without any processors and see if that doesn't lead to constant increase in RAM usage. Will get back with those results.
@lllyasviel It has something to do with Low VRAM setting as per my initial observation. I generated images back to back for 10 mins with canny edge model and preprocessor. It had a very stable memory consumption. All the images below have the same time axis on X.
And then I switched on the Low VRAM setting and got the following memory consumption.
I had very similar behaviour with canny + depth_zoe when lowVRAM was switch on for depth Zoe.
I will be checking again with other processors without VRAM setting and check if the RAM consumption is stable. Let me know your views.
thanks for the data we will take a look soon
just an idea: I had previously reported that for clip_vision there was a memory leak due to not using "with torch.no_grad() :". This was for CN1.0, I'm not sure if it's already added, may apply to other annotators.
@tkalayci71 can you explain where this might need to be added for me to check? The file or folder or more information would prove very useful.
@lllyasviel I have been running the code for more than an hour without low VRAM setting and till now, there has been no issue at all with random increase in system RAM consumption. Strongly feel this might be an implementation issue of Low VRAM.
@ghpkishore it's in annotator / clip / init py, apply function last 3 lines need to be wrapped inside torch.no_grad, but I wouldn't recommend modifying code, they'll probably will solve it soon.
@lllyasviel by the way, see also: https://github.com/huggingface/transformers/issues/20636
@lllyasviel similar to how the program is killed if VRAM gets over, is it possible to add a check for system RAM as well. As in incase if the system RAM exceeds 95% kill the program?
I wrote here that even if I turned off Controlnet, clearing the Enabled checkbox does not delete the model from VRAM, it still has +3 GB! https://github.com/vladmandic/automatic/discussions/386#discussioncomment-5762338
I do not think cn has vram leak problem. If pytorch moves model out of gpu, it will not clear the vram - it just marks those vram as unoccupied and all other codes can use those vram even if those vram looks occupied in OS monitor. but cn may have some ram issue and we will take a look considering our workloads
a possible test would be to mock the pytorch modules so that they perform no-ops, or, basic trivial operations, while measuring memory use.
if we can measure the memory use while using pytorch in a profiler and then, measure it when mocking pytorch, it would possibly help. but i don't know enough about the internals, to pull this off.
Hi,
I created highly optimized ControlNet v1.1.232 version. You can use this version with 4GB VRAM with max 2 Multi ControlNet and Hires. fix. All added and changed parts signed with "Hikmet Koyuncu".
Extract "webui" directory on your AUTOMATIC1111 "webui" directory and overwrite files.
You must firstly convert your ControlNet preprocessor and ControlNet models to fp16 format.
For ControlNet models you can use my edited "extract_controlnet.py" file. You must use "--half" and "--convert" arguments.
For ControlNet proprecessors (annotator) you can use "convert_controlnet_preprocessor_fp16.py" file.
Example:
python.exe "convert_controlnet_preprocessor_fp16.py" --src "myPreprocessor.pth" --dst "myPreprocessor_fp16.pth"
@lllyasviel, I've been having RAM problems for a long time, and recently it became quite serious since I increased the number of controlNet modules. Specifically,
inpaint
, depth
), after about 17 hours of continuously creating images, my server would be full of RAM (32G RAM).scribble
module (total: inpaint
, depth
, scribble
), then after about 2 hours of continuously creating images, my server will be full of RAM (32G RAM).I tried adding 10G of swap memory
, but it's still full RAM soon.
This is a metric that tracks RAM usage in percentage in last 7 days:
P/s: I'm pretty sure the problem is controlNet, because I have another server that doesn't use controlNet, it always creates images continuously but still doesn't have the problem of full RAM.
ControlNet load models in the VRAM but does not remove. And each time your VRAM usage increase. I published fixed version.
ControlNet load models in the VRAM but does not remove. And each time your VRAM usage increase. I published fixed version.
Hi @hikmet-koyuncu, My VRAM is fine, but RAM is not in my case. As the title of this issue, this is a RAM problem. thanks
Yes. ControlNet move some models VRAM to RAM (some models not, it is a bug) after image creation, but never remove. I fixed this problem.
Yes. ControlNet move some models VRAM to RAM (some models not, it is a bug) after image creation, but never remove. I fixed this problem.
Hi @hikmet-koyuncu, after updating control extension to https://github.com/Mikubill/sd-webui-controlnet/commit/fce6775a6dddef52ecd658259e909687d9dedf72, the memory leak issue is still not resolved. More specifically on how I use ControlNet via API:
"alwayson_scripts": {
"controlnet": {
"args": [
{
"module": "inpaint_only",
"model": "control_v11p_sd15_inpaint [ebff9138]",
"control_mode": "ControlNet is more important"
}
]
}
}
"alwayson_scripts": {
{
"controlnet": {
"args": [
{
"module": "depth",
"model": "control_v11f1p_sd15_depth [cfd03158]",
"control_mode": "ControlNet is more important"
}
]
}
}
}
"alwayson_scripts": {
{
"controlnet": {
"args": [
{
"module": "none",
"model": "control_v11p_sd15_scribble [d4ba51ff]",
"control_mode": "ControlNet is more important"
}
]
}
}
}
logs:
023-11-07 04:17:03,611 - ControlNet - INFO - Loading model from cache: control_v11p_sd15_inpaint [ebff9138]
2023-11-07 04:17:03,620 - ControlNet - WARNING - A1111 inpaint and ControlNet inpaint duplicated. ControlNet support enabled.
2023-11-07 04:17:03,621 - ControlNet - INFO - Loading preprocessor: inpaint
2023-11-07 04:17:03,621 - ControlNet - INFO - preprocessor resolution = -1
2023-11-07 04:17:03,689 - ControlNet - INFO - ControlNet Hooked - Time = 0.0996100902557373
100%|██████████| 22/22 [00:02<00:00, 8.60it/s]
Total progress: 100%|██████████| 22/22 [00:02<00:00, 7.91it/s]
2023-11-07 04:17:10,561 - ControlNet - INFO - Loading model: control_v11p_sd15_scribble [d4ba51ff]
2023-11-07 04:17:15,289 - ControlNet - INFO - Loaded state_dict from [/app/extensions/sd-webui-controlnet/models/control_v11p_sd15_scribble.pth]
2023-11-07 04:17:15,289 - ControlNet - INFO - controlnet_default_config
2023-11-07 04:17:17,925 - ControlNet - INFO - ControlNet model control_v11p_sd15_scribble [d4ba51ff] loaded.
2023-11-07 04:17:18,009 - ControlNet - INFO - Loading preprocessor: none
2023-11-07 04:17:18,009 - ControlNet - INFO - preprocessor resolution = -1
2023-11-07 04:17:18,039 - ControlNet - INFO - ControlNet Hooked - Time = 7.499300956726074
100%|██████████| 25/25 [00:03<00:00, 7.98it/s]
Total progress: 100%|██████████| 25/25 [00:03<00:00, 7.32it/s]
2023-11-07 04:17:21,814 - ControlNet - INFO - Loading model from cache: control_v11p_sd15_inpaint [ebff9138]
2023-11-07 04:17:21,823 - ControlNet - WARNING - A1111 inpaint and ControlNet inpaint duplicated. ControlNet support enabled.
2023-11-07 04:17:21,824 - ControlNet - INFO - Loading preprocessor: inpaint
2023-11-07 04:17:21,824 - ControlNet - INFO - preprocessor resolution = -1
2023-11-07 04:17:21,898 - ControlNet - INFO - ControlNet Hooked - Time = 0.1059107780456543
100%|██████████| 22/22 [00:02<00:00, 10.14it/s]
Total progress: 100%|██████████| 22/22 [00:02<00:00, 9.52it/s]
100%|██████████| 19/19 [00:03<00:00, 5.12it/s]0:00, 10.58it/s]
Total progress: 100%|██████████| 19/19 [00:03<00:00, 5.00it/s]
100%|██████████| 13/13 [00:00<00:00, 13.21it/s]0:00, 5.13it/s]
Total progress: 100%|██████████| 13/13 [00:01<00:00, 9.21it/s]
2023-11-07 04:17:30,698 - ControlNet - INFO - Loading model from cache: control_v11p_sd15_inpaint [ebff9138]
2023-11-07 04:17:30,706 - ControlNet - WARNING - A1111 inpaint and ControlNet inpaint duplicated. ControlNet support enabled.
2023-11-07 04:17:30,707 - ControlNet - INFO - Loading preprocessor: inpaint
2023-11-07 04:17:30,707 - ControlNet - INFO - preprocessor resolution = -1
2023-11-07 04:17:30,785 - ControlNet - INFO - ControlNet Hooked - Time = 0.11857032775878906
Hi,
I added "Broom" icon. If you click it, RAM and VRAM will be clear.
I don't want everytime clear RAM, because this can slow our workflow. When you get RAM error, then click broom icon. This clear VRAM and RAM, and print RAM and VRAM amount at this time in the DOS Console window.
And, if you using fp32 models and you have small amount of RAM, then you must use fp16 models. You can convert fp32 models to fp16 models. I shared this python program too.
I am using 16 GB RAM and 4 GB VRAM and I can use 2 ControlNet Unit same time.
Hi @hikmet-koyuncu, please make a fork or contribute to this repo and I can take a look at your code
Hi,
I don't know using GitHub too much. When I have a free time, I will learn. I can send you my edited version of "ControlNet 1.1.232". I added comment "Hikmet Koyuncu" on each changed part.
hi @hikmet-koyuncu, The code you uploaded to mediafire seems to be old (2023-07-18), and it's missing some code so I can't run your code yet. Can you upload the full update?
Yes, because I uploaded long long ago, but nobody cared this. I am still using this version.
Hi @lllyasviel,
It's been quite a while since this issue was reported, can you share with me some tools or direction of investigation to find out which part is leaking memory?
@hungtooc I thought it can be related to parameters change, but most likely the leak occurs after adding several ControlNet units. I've made a screen capture - https://drive.google.com/file/d/1l78ZkVJQx3E4S2Q9i61fasOJTt1feIZ0/view?usp=sharing - it starts leaking after 3:15 time or so.
@nchaly Thanks for the reproduction of the issue! I am going to take a deeper look into this.
I added some tracemalloc profiling code. First run log:
2023-12-30 22:05:23,691 - ControlNet - INFO - After generation:███████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:15<00:00, 1.35it/s]
2023-12-30 22:05:24,021 - ControlNet - INFO - D:\stable-diffusion-webui\extensions\sd-webui-controlnet\scripts\controlnet.py:843: size=11.9 MiB (+11.9 MiB), count=5 (+5), average=2430 KiB
2023-12-30 22:05:24,022 - ControlNet - INFO - D:\stable-diffusion-webui\modules\processing.py:908: size=1728 KiB (+1728 KiB), count=2 (+2), average=864 KiB
2023-12-30 22:05:24,024 - ControlNet - INFO - D:\stable-diffusion-webui\extensions\sd-webui-controlnet\scripts\controlnet.py:1150: size=1728 KiB (+1728 KiB), count=2 (+2), average=864 KiB
2023-12-30 22:05:24,026 - ControlNet - INFO - D:\stable-diffusion-webui\extensions\sd-webui-controlnet\scripts\processor.py:14: size=910 KiB (+910 KiB), count=4 (+4), average=228 KiB
2023-12-30 22:05:24,028 - ControlNet - INFO - C:\Users\hcl\AppData\Local\Programs\Python\Python310\lib\linecache.py:137: size=732 KiB (+732 KiB), count=7122 (+7122), average=105 B
2023-12-30 22:05:24,031 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:461: size=316 KiB (+316 KiB), count=1497 (+1497), average=216 B
2023-12-30 22:05:24,031 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:468: size=309 KiB (+309 KiB), count=1734 (+1734), average=183 B
2023-12-30 22:05:24,032 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:458: size=286 KiB (+286 KiB), count=2470 (+2470), average=118 B
2023-12-30 22:05:24,033 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:473: size=187 KiB (+187 KiB), count=1497 (+1497), average=128 B
2023-12-30 22:05:24,036 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:472: size=187 KiB (+187 KiB), count=1497 (+1497), average=128 B
After first generation
2023-12-30 22:07:36,321 - ControlNet - INFO - After generation:███████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:14<00:00, 1.26it/s]
2023-12-30 22:07:36,439 - ControlNet - INFO - D:\stable-diffusion-webui\modules\processing.py:908: size=1728 KiB (+1728 KiB), count=2 (+2), average=864 KiB
2023-12-30 22:07:36,440 - ControlNet - INFO - D:\stable-diffusion-webui\extensions\sd-webui-controlnet\scripts\controlnet.py:1150: size=1728 KiB (+1728 KiB), count=2 (+2), average=864 KiB
2023-12-30 22:07:36,440 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:468: size=189 KiB (+189 KiB), count=995 (+995), average=195 B
2023-12-30 22:07:36,441 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:461: size=179 KiB (+179 KiB), count=847 (+847), average=216 B
2023-12-30 22:07:36,441 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:1622: size=156 KiB (+156 KiB), count=144 (+144), average=1112 B
2023-12-30 22:07:36,442 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:458: size=148 KiB (+148 KiB), count=1319 (+1319), average=115 B
2023-12-30 22:07:36,442 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:473: size=106 KiB (+106 KiB), count=847 (+847), average=128 B
2023-12-30 22:07:36,442 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:472: size=106 KiB (+106 KiB), count=847 (+847), average=128 B
2023-12-30 22:07:36,443 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:471: size=106 KiB (+106 KiB), count=847 (+847), average=128 B
2023-12-30 22:07:36,443 - ControlNet - INFO - d:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:470: size=106 KiB (+106 KiB), count=847 (+847), average=128 B
The first generation caches the preprocessor result, so you see a 11MB increase, but it does not happen again after the first run. I do not see anything else significant enough to be considered as an observable memory leak.
@nchaly I cannot reproduce your result locally. When I turn on multiple ControlNet units, the memory usage of the A1111 process fully recovers after each generation.
Can you run your local setup with --controlnet-tracemalloc
commandline arg?
@huchenlei thank you for looking into this.
I've updated controlnet to latest main version - I'm able to reproduce the issue. A1111 is latest master branch too.
Here is the summary of what I do:
Step 4 here is where extra loading is happening.
I presume that adding second unit somehow impacts caching mechanisms.
If I filter log with only "loading model" lines, it is suspicious that after fist usage of "canny" the model is loaded "from cache", but after adding second unit, "loading from cache" is not logged:
# step 2 first time
2023-12-31 12:54:54,121 - ControlNet - INFO - Loading model: control_sd15_canny [fef5e48e]
# step 2 second time
2023-12-31 12:55:06,385 - ControlNet - INFO - Loading model from cache: control_sd15_canny [fef5e48e]
# step 3
2023-12-31 12:55:33,246 - ControlNet - INFO - Loading model from cache: control_sd15_canny [fef5e48e]
2023-12-31 12:55:33,450 - ControlNet - INFO - Loading model: coadapter-depth-sd15v1 [93aff3ab]
# step 4 and subsequent generations.
2023-12-31 12:55:47,014 - ControlNet - INFO - Loading model: control_sd15_canny [fef5e48e]
2023-12-31 12:55:50,920 - ControlNet - INFO - Loading model: coadapter-depth-sd15v1 [93aff3ab]
2023-12-31 12:56:03,842 - ControlNet - INFO - Loading model: control_sd15_canny [fef5e48e]
2023-12-31 12:56:07,614 - ControlNet - INFO - Loading model: coadapter-depth-sd15v1 [93aff3ab]
Here is the full log:
I wrote and fixed this issue in my example version. ControlNet loads models and does not remove models from VRAM and RAM. 32-bit models take up a lot of space in the RAM and VRAM. If models remove from RAM and VRAM after the image creation process, if use 16-bit models rather than 32-bit models and if a button add for remove RAM and VRAM (I did it for my example version), this problem fixed.
@huchenlei thank you for looking into this.
I've updated controlnet to latest main version - I'm able to reproduce the issue. A1111 is latest master branch too.
Here is the summary of what I do:
- generate base image without ControlNet.
- add ControlNet unit 0, with "canny" setup, generate image once, then second time - everything is fine here.
- add ControlNet unit 1, with "depth" setup, generate image once.
- generate second image without any changes.
Step 4 here is where extra loading is happening.
I presume that adding second unit somehow impacts caching mechanisms.
If I filter log with only "loading model" lines, it is suspicious that after fist usage of "canny" the model is loaded "from cache", but after adding second unit, "loading from cache" is not logged:
# step 2 first time 2023-12-31 12:54:54,121 - ControlNet - INFO - Loading model: control_sd15_canny [fef5e48e] # step 2 second time 2023-12-31 12:55:06,385 - ControlNet - INFO - Loading model from cache: control_sd15_canny [fef5e48e] # step 3 2023-12-31 12:55:33,246 - ControlNet - INFO - Loading model from cache: control_sd15_canny [fef5e48e] 2023-12-31 12:55:33,450 - ControlNet - INFO - Loading model: coadapter-depth-sd15v1 [93aff3ab] # step 4 and subsequent generations. 2023-12-31 12:55:47,014 - ControlNet - INFO - Loading model: control_sd15_canny [fef5e48e] 2023-12-31 12:55:50,920 - ControlNet - INFO - Loading model: coadapter-depth-sd15v1 [93aff3ab] 2023-12-31 12:56:03,842 - ControlNet - INFO - Loading model: control_sd15_canny [fef5e48e] 2023-12-31 12:56:07,614 - ControlNet - INFO - Loading model: coadapter-depth-sd15v1 [93aff3ab]
Here is the full log:
Thanks for the log message. I think the log here is normal, as by default control_net_model_cache_size
is set to 1. When you load the depth model, the canny model will be ejected from the cache. Can you try set control_net_model_cache_size
to 2 and see if it makes any difference? @nchaly
shared.opts.add_option("control_net_model_cache_size", shared.OptionInfo(
1, "Model cache size (requires restart)", gr.Slider, {"minimum": 1, "maximum": 10, "step": 1}, section=section))
Yep, that helps, now both load from cache.
Bumping this up. Been using faceid and noticed terrible memory leak when using alot of photos in multi-input. If I use a few, the ram seems to bounce back just fine. However when I use a ton and continually run new batches, I see my ram tank and eventually restort to using my swapfile. Once both ram and swapfile are used up, my computer hard locks completely. Without the swapfile, several generations will hardlock my pc. Been using webui for over a year, only experienced this issue when running controlnet and specifically faceid with multi-input.
Is there an existing issue for this?
What happened?
When trying to use the extension without any model cache being switched on, after 10 tries, the SSH connection to my EC2 instance fails. This always happens during build_controlnet_model function, because the output : Loading model {model} gets printed but the loading state_dict does not.
Steps to reproduce the problem
What should have happened?
The SSH connection should not have gotten closed. There seems to be some error here. It happens suddenly as well without any proper reproducible number of steps. But it does fail. It doesn't fail for normal automatic1111, and is a controlnet web-ui problem
Commit where the problem happens
webui: [22bcc7be] controlnet: 2270f364e167b9531daf9a8bd1d62cb2dbfa4d00
What browsers do you use to access the UI ?
Google Chrome
Command Line Arguments
Console logs
Usually it is supposed to be:
But while it fails it stops at:
The rest of the log isn't visible and SSH gets disconnected.
Additional information
Happens with model cache as well. But after many more tries