Consistently hangs after 6-7 minutes since yesterday

ifeelrobbed commented 2 years ago

Describe the bug Consistently hangs after 6-7 minutes since yesterday (10/15). Hopping on the command line the process is shown as killed. This happens both starting with webui.sh and launch.py.

To Reproduce Steps to reproduce the behavior: 6 - 7 minutes of activity in the web UI. The UI hangs and eventually the process is killed.

Expected behavior Not hang?

Screenshots Screenshot_20221016_083554

Desktop (please complete the following information):

OS: Pop OS 22.04
Browser: Firefox
Commit revision: fc220a51cf5bb5bfca83322c16e907a18ec59f6b

ifeelrobbed commented 2 years ago

Memory leak maybe?

From /var/log/syslog:

Oct 16 12:28:53 pop-os kernel: [69600.171513] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-36.scope,task=python3.10,pid=48807,uid=1000 Oct 16 12:28:53 pop-os kernel: [69600.171634] Out of memory: Killed process 48807 (python3.10) total-vm:32734412kB, anon-rss:13482224kB, file-rss:65752kB, shmem-rss:14340kB, UID:1000 pgtables:37700kB oom_score_adj:0 Oct 16 12:28:55 pop-os systemd[1]: session-36.scope: A process of this unit has been killed by the OOM killer.

ifeelrobbed commented 2 years ago

Watching memory climb as I run it. Form restart to crash, with a little of the middle not in the screenshots.

Screenshot_20221016-125730_JuiceSSH Screenshot_20221016-125744_JuiceSSH

jnpatrick99 commented 2 years ago

I had the same problem after update (https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/2782). Restarted the computer and it seems to work fine. Try to restart!

(I have 32GB of memory and I don't think memory is a problem, never hit the limit).

Chilluminati91 commented 2 years ago

Same issue here since about 2 days. Running native on Ubuntu. Sometimes the whole PC freezes completely and I have to hard reset. Sometimes it freezes for up to 40sec and at times when I keep the console as active window I get the same error output as yours and can restart the webui. 32GB Ram, i5-12600k, RX 6650 XT.

Edit: It has either been fixed or is related to "Radeon Profile" on Linux. No freezes since my last restart without radeon profile active. Edit2: Spoke one second before disaster. PC crashed again after the run after the first edit. Not Radeon Profile related and not fixed yet.

ifeelrobbed commented 2 years ago

Unfortunately rebooting didn't seem to change anything.

jnpatrick99 commented 2 years ago

Unfortunately rebooting didn't seem to change anything.

Did you update Gradio and other stuff? Seems recent updates require new versions of libraries. pip install pip-upgrader and then pip-upgrade, it will update python dependancies from new requirements.txt)

ifeelrobbed commented 2 years ago

Unfortunately rebooting didn't seem to change anything.

Did you update Gradio and other stuff? Seems recent updates require new versions of libraries. pip install pip-upgrader and then pip-upgrade, it will update python dependancies from new requirements.txt)

Went through those steps. Gradio was already up to date. It did update 3 others: fairscale,timm,transformers

Screenshot_20221016_190100

Still maxed out memory and was killed.

YudhaDev commented 2 years ago

possibly a memory leak, as for prevention I need to create a dynamic swapfiles up to 10GB on my system

jn-jairo commented 2 years ago

Same here, there is some memory leak, probably introduced day 14-16, older commits don't have that issue.

The memory increases right after the generation of the batches starts, keeps the same memory usage during the generation, and increases again when starts the next batches by clicking in the generate button.

futurevessel commented 2 years ago

Yes, same problem for me, it can eat up ~1gb of ram per generation, which is never returned to the system, so a lot of shutting down Stable Diffusion and restarting it in order to reclaim said ram becomes a necessity.

Running on a RTX 3060 12gb, 32gb ram, Arch Linux

ifeelrobbed commented 2 years ago

Dang. Got excited when I saw the commit fix bug for latest model merge RAM improvement

However, I still maxed out memory, swap, and the process was killed after ~6 minutes.

drax-xard commented 2 years ago

Having the same issue, after some time generating the process wil die with "webui.sh:; line 141 (pid) killed" syslog: Oct 18 21:43:09 DraxPC kernel: [ 5364.528101] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-688059d0-5100-4a81-983d-15c959b6b48a.scope,task=python3,pid=5445,uid=1000 Oct 18 21:43:09 DraxPC kernel: [ 5364.528182] Out of memory: Killed process 5445 (python3) total-vm:30828808kB, anon-rss:13574820kB, file-rss:70656kB, shmem-rss:14340kB, UID:1000 pgtables:35620kB oom_score_adj:0 Oct 18 21:43:09 DraxPC systemd[1]: user@1000.service: A process of this unit has been killed by the OOM killer. Oct 18 21:43:09 DraxPC systemd[1163]: vte-spawn-688059d0-5100-4a81-983d-15c959b6b48a.scope: A process of this unit has been killed by the OOM killer.

Ryzen 5600x 16gb ram GTX 1650 4gb vram Linux mint 21.whatever

jn-jairo commented 2 years ago

I found the problem, it is the gradio 3.5, the leak starts in the commit 4ed99d599640bb86bc793aa3cbed31c6d0bd6957, downgrading the gradio back to 3.4.1 solves the leak, I don't know what other changes was made because of gradio 3.5 that can break by downgrading but it is working good for me with the downgrading so far.

What do you think @AUTOMATIC1111 can you check it out?

drax-xard commented 2 years ago

How would one go about downgrading gradio for the time being?

leandrodreamer commented 2 years ago

i have this problem when i use --medvram (ram fills up and then swap until the system crash), but not when i don't

ifeelrobbed commented 2 years ago

Interesting. I'm using the following arguments: --medvram --opt-split-attention --force-enable-xformers

jn-jairo commented 2 years ago

i have this problem when i use --medvram (ram fills up and then swap until the system crash), but not when i don't

lowvram and medvram offloads the model parts to cpu when not being used by the gpu, so using it will use more ram and less vram, it doesn't leak the memory, but you will need more ram.

leandrodreamer commented 2 years ago

i have this problem when i use --medvram (ram fills up and then swap until the system crash), but not when i don't

lowvram and medvram offloads the model parts to cpu when not being used by the gpu, so using it will use more ram and less vram, it doesn't leak the memory, but you will need more ram.

hm but when i start (and on the first generations) i have quite a lot of free ram (abt 6gb plus 10swap), for every image generated it takes a little bit and after like 50 images it fills, if it didn't leak it should stay around the same ram usage and not build up over time

jn-jairo commented 2 years ago

i have this problem when i use --medvram (ram fills up and then swap until the system crash), but not when i don't

lowvram and medvram offloads the model parts to cpu when not being used by the gpu, so using it will use more ram and less vram, it doesn't leak the memory, but you will need more ram.

hm but when i start (and on the first generations) i have quite a lot of free ram (abt 6gb plus 10swap), for every image generated it takes a little bit and after like 50 images it fills, if it didn't leak it should stay around the same ram usage and not build up over time

@leandrodreamer as I said previously (https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/2858#issuecomment-1283381801) I found it is because the gradio upgrade and downgrading it removes the leak.

I didn't say there is no leak, I said that by using lowvram/medvram you will use more RAM than without it, so the increase in memory due to lowvram/medvram is not a leak, it is supposed to happen.

I didn't identify any leak related to those options.

leandrodreamer commented 2 years ago

oh got it, what i find strange is that i don't have any leak problems without the --medvram param, i can make hundreds of images no problem (without downgrading gardio), maybe a mix of the new gardio version and that param?, or maybe i have a completly diferent problem here idk :b

jn-jairo commented 2 years ago

@leandrodreamer yes it may be a mix of settings, you can try to revert the commit 4ed99d599640bb86bc793aa3cbed31c6d0bd6957 to test if your problem is the same I identified or something else.

ifeelrobbed commented 2 years ago

I just deactivated and deleted venv, reverted to 7d6042b908c064774ee10961309d396eabdc6c4a, which is the last commit before Gradio 3.5, commented out the line in webui.sh that performs git pull and let it just reinstall everything. Memory usage is steady and I am generating images just fine again.

ifeelrobbed commented 2 years ago

Alright, after a day of no issues, I performed a git pull, modified requirements.txt and requirements_versions.txt back to gradio==3.4.1, and commented out the git pull line in webui.sh. So far so good. The only change from the latest commit should be the gradio downgrade and memory usage is steady.

Chilluminati91 commented 2 years ago

Havent had issues since yesterday evening, seems to be fixed.

futurevessel commented 2 years ago

Still had the same problem, nothing changed after latest git pull. Decided to reinstall from scratch, and lo and behold, no more memory leaks.

jn-jairo commented 2 years ago

Still had the same problem, nothing changed after latest git pull. Decided to reinstall from scratch, and lo and behold, no more memory leaks.

Sadly it didn't worked for me, I reinstalled everything and the leak persists with the last master commit.

futurevessel commented 2 years ago

Ok, so I ran automatic1111 through this docker image: https://github.com/AbdBarho/stable-diffusion-webui-docker

And it had the same problem for me, eating ram. So I went back to compare my previous installation of automatic1111 (I backed it up when I reinstalled) and the only difference was that in webui-user.sh, I had the --medvram parameter

So I edited the docker-compose.yml in the docker image and removed --medvram, and now there are no more leaks, so I added --medvram to my reinstalled local version and it leaks memory again. So for me, just like leandrodreamer stated in this thread, --medvram is the culprit.

Now I have 12gb VRAM, so not being able to use --medvram isn't that much of a problem, but for those with less VRAM, not being able to use it might be a pain or even make it impossible to run ?

ifeelrobbed commented 2 years ago

Yeah, with my 2060 I have to use --medvram for it to work at all. The only way I've found to prevent the memory leak regardless of commit I revert to is to force Gradio 3.4.1.

floorcat commented 2 years ago

Same thing happening to me. Manually downgrading gradio to 3.4.1 via pip seems to fix this problem.

Running in docker on linux, 32gb system ram, rx580 4gb.

slix commented 2 years ago

Is this an issue in gradio (upstream) or an issue with how this repo uses gradio?

Downgrading gradio apparently fixes the issue, which strongly suggests that the issue is upstream.

floorcat commented 2 years ago

did some further testing and this commit to gradio causes the leak: https://github.com/gradio-app/gradio/commit/a36dcb59750b1f4cd7e66d3b39ba0621ee89183b

Edit: I even tested running without --medvram with latest gradio and observed no leak, so the cause is --medvram option combined with https://github.com/gradio-app/gradio/commit/a36dcb59750b1f4cd7e66d3b39ba0621ee89183b or later.

floorcat commented 2 years ago

still happening as of d61f0ded24b2b0c69b1b85293d6f71e4de0d1c63

ifeelrobbed commented 2 years ago

Yeah I was hoping Gradio 3.8 with 17087e306d4f888b67037a528bc4cf161995e1c4 would work, but still have the same issue.

Downgraded to 3.4.1 and am back in business.

vgf89 commented 1 year ago

Still happening on 828438b. I can only generate like 5 images before it crashes if I'm using --medvram on my AMD card. I'm using COMMANDLINE_ARGS="--listen --precision full --no-half --medvram"

I also need to use "export HSA_OVERRIDE_GFX_VERSION=10.3.0" for pytorch to work (I'm using a 5700XT, 8GB vram)

With medvram I get around 3it/s and the process gets killed, but without it I only get 1.3it/s and my monitors sometimes disconnect lol, so medvram working correctly is super important.

hemangjoshi37a commented 1 year ago

Yeah I was hoping Gradio 3.8 with 17087e3 would work, but still have the same issue.

Downgraded to 3.4.1 and am back in business.

Yeah I also solved it with same solutions. Thanks. https://hjlabs.in

TheEast1 commented 1 year ago

gradio to 3.4.1 for ubuntu system downgrade, please

TheEast1 commented 1 year ago

是的，我希望带有 3.8e8 的 Gradio 17087 可以工作，但仍然存在相同的问题。降级到 3.4.1 并重新开始营业。

是的，我也用相同的解决方案解决了它。谢谢。https://hjlabs.in

gradio to 3.4.1 for ubuntu system downgrade, please

R4rum commented 1 year ago

I've been generating a lot of X/Y plots in a single session, and I don't feel it's leaking memory in that use case. Possibly of note is that the selected Y axis is Checkpoint name.

I've not tried switching gradio version yet.

ghost commented 1 year ago

Haven't downgraded gradio yet but been experiencing this as well. 3050 runs okay without it but I don't think I can try training hypernetworks without --medvram so the leak is annoying.

jn-jairo commented 1 year ago

Sorry, for the close, lag made me touch the close button.

ghost commented 1 year ago

I downgraded to gradio 1.4.1 and triple checked it is still 1.4.1 but this memory usage is still maxed, even after all jobs are completed...

ghost commented 1 year ago

Gradio 3.16.2 still has a RAM leak when using --medvram This is frustrating because I don't know if it's going to get fixed

tzwel commented 1 year ago

change gradio==[versionnumber] to 3.4.1 in requirements.txt pip install -r requirements.txt profit

ghost commented 1 year ago

3.4.1 is incompatible but also extremely laggy.

tzwel commented 1 year ago

yup. it lags a lot, but it fixed the problem for me to some extent

yxlwfds commented 1 year ago

3.4.1 不兼容,暂时去掉了--medvram,没有发现泄漏

mjranum commented 1 year ago

[I am posting this in multiple places; it seems to be a common issue] I have had a similar problem, and solved it. Apparently, permanently. Here's what I think is going on: the websockets layer between A1111 and SD is losing a message and hanging waiting for a response from the other side. It appears to be a result of when there is a lot of data going back and forth, possibly overrunning a queue someplace. If you think about it, A1111 and SD are shovelling big amounts of image data across the websockets. And here's how you exacerbate it: tell A1111 to display each image as its created, then set a "new image display time" down around 200ms. If you do that, it'll start failing pretty predictably, at random. How to fix: have it display the image every 30 iterations and set the display time at around 10 seconds. Poof. Problem gone. [This problem resembles a bug in Sun RPC from back around 1986; plus ca change...]

AlanMW commented 8 months ago

This problem still exists, removing --medvram stopped the memory leak when generating images, switching between checkpoint does seem to do the same thing though. After switching the ram spikes but doesn't go back down.

LiBoHanse commented 8 months ago

can confirm this is the case, with --medvram sdwebui gradually consumes 31.8GB memory and get killed for OOM in dmesg. I was doing sdxl image generation with refining on. adding --lowram does not mitigate this issue.

LiBoHanse commented 8 months ago

This problem still exists, removing --medvram stopped the memory leak when generating images, switching between checkpoint does seem to do the same thing though. After switching the ram spikes but doesn't go back down.

same here, i was doing sdxl with refiner, the program quickly get killed for OOM as it switches between the base and refiner model.

AUTOMATIC1111 / stable-diffusion-webui

Consistently hangs after 6-7 minutes since yesterday #2858