AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI
GNU Affero General Public License v3.0
139.15k stars 26.41k forks source link

[Bug]: Linux gets unresponsive after several generations (RAM) #6850

Closed tzwel closed 1 year ago

tzwel commented 1 year ago

Is there an existing issue for this?

What happened?

After several generations, RAM skyrockets, thus making the system unresponsive and a restart is needed. OS: Manjaro linux GPU: RX 6600xt

Steps to reproduce the problem

  1. Launch webui
  2. Press Generate several times
  3. Watch what happens with the memory usage, and restart your pc
  4. Repeat

What should have happened?

system shouldn't crash

Commit where the problem happens

e0e80050091ea7f58ae17c69f31d1b5de5e0ae20

What platforms do you use to access UI ?

Linux

What browsers do you use to access the UI ?

Mozilla Firefox

Command Line Arguments

--precision full --no-half --medvram

Additional information, context and logs

i wonder if it can even be reproduced. is this a memory leak? the ram usage is normal until some random point which it decides to crash the system. i observed it going up very slowly each generation, but the usage increase was negligible, sorts of 0.5-1% i would upload a screenshot, but it's basically 99% when it happens

update: terminal sometimes just closes instead of crashing the system

westmancurtis commented 1 year ago

Are you switching models at all?

tzwel commented 1 year ago

Are you switching models at all?

not at all

ArcticBeat05 commented 1 year ago

Yeah I've been having this issue for several weeks now, ever since the new gradio update.

rn i'm running --share --xformers --opt-channelslast --allow-code --enable-insecure-extension-access --gradio-debug.

I've tried adding and removing--precision full --no-half --no-half-vae --lowram --medvram --opt-split-attention, --allow-code.

Happens on Chrome as well, I'm running it on google colab free.

Seems to be fine when running normal res inference, but after I'm done doing an upres, it fails to respond or when after many batch counts.

Update: Before I've only tried opening gradio in chrome, but google colab was still on firefox. I've tried running colab in microsoft edge and opening gradio in there as well, the problem seems so have been fixed.

tzwel commented 1 year ago

i have trouble understanding you, please specify are you or linux? are you using the webui locally on your gpu?

i'm about to try switching the browser, but this seems very odd, because edge runs on chromium, just like chrome

Dnak-jb commented 1 year ago

Im also experiencing system hangs. Haven't been watching system resources close, but i will and ill update this post. Ubuntu 22.04 , fresh install, using anything-v4 on a 5700xt. This is my first day experimenting with the A1111 webui so i can't tell how old it is, but i can tell you I've hung about 4 times in the last 2 hours.

Edit:after playing around all day, i note my sysram always goes up after each image generation. Culminating in a crash once i ate it all up. I installed a pagefile and that seemed to calm the crashing down at the cost of some hiccups, but i could still crash it if i ate the pagefile up too. It never goes back down, however, until i close terminal/the program. Toggling a bunch of settings all day did nothing to change that. Then i removed the --medvram argument from startup and noticed my sysram never went up after each consecutive generation. The only other change at the same time i made was stopping images from being automatically saved, so hopefully its not that and i am wrong. I assume the swapping the program does to keep vram usage low is somehow the culprit? Idk im not a nerd.

As far as i can tell, removing the medvram argument stops the memory leak. But if that doesn't, try disabling automatic image saving.

If you need any more info just respond to this thread or however you call someones attention on this website.

ArcticBeat05 commented 1 year ago

are you or linux?, are you using the webui locally on your gpu?

I'm on windows 10, but use Firefox. Used Google colabs GPU/CPU. Once I opened up google colab on chromium (edge) instead of firefox, it fixed the problem of unresponsive "generate" button. I have no idea why it fixed it, I believe its firefox related. This has been broken for me for several weeks, but I didnt think to say anything.

Now my model wont even load on firefox, I recommend to start using chrome/edge for now.

tzwel commented 1 year ago

this is very weird. I'm observing more ram usage over time even when I'm not generating anything.

tzwel commented 1 year ago

I found this issue #2858 that seems to reference the same problem as mine i tried downgrading gradio and it seemed to help a little bit, but the problem returned now I removed the --precision full --no-half parameters from webui.user.sh and it seems to work the ram is getting clogged up very slowly (if even), and it doesn't skyrocket after generations

i won't close the issue yet until i confirm it fixes the problem for someone else

tzwel commented 1 year ago

ra well

tzwel commented 1 year ago

Adding a swapfile seems to make the issue less annoying

sudo swapoff -a
sudo dd if=/dev/zero of=/swapfile bs=1M count=8192
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

add /swapfile none swap sw 0 0 to /etc/fstab

grep SwapTotal /proc/meminfo

check if it worked

lsaa commented 1 year ago

I'm also experiencing this

Done without changing the model or adding any embeds/hypernets/loras to the prompt before first generation: total 17279676K

Generations 1: 21838172K 2: 23507288K 3: 23553092K 4: 23647104K 5: 23653284K 6: 24031884K 7: 26616812K 8: 29185140K 9: 26011716K 10: 26501696K 11: 29935684K (this one had a NaN VAE exception before it finished) 12: 26752832K 13: 28358212K 14: 28398896K 15: KILLED OOM

It seems to only do this once you click generate. If I do a batch of 15 images it only goes up slightly compared to a single image.

Edit:

specs: OS: Artix Linux kernel 6.1.7-zen1-1-zen RAM: 16GB (and a 4GB swap) GPU: RX 6400 XT 4GB Python 3.10.8 torch: 1.13.1+rocm5.2 Launch opts: --medvram

claychinasky commented 1 year ago

is it possible that, this memory leak is coming from the hashing code? I haven't looked into it deeply, but I'm on gradio 3.16.2 and after adding --no-hashing which added recently,having less problems I think.

lsaa commented 1 year ago

updated and tried --no-hashing it seems better but still runs out of memory. The behavior is slightly different this time: it remains using a constant amount of memory between generations but after a while it starts going up like it used to.

killacan commented 1 year ago

I am having the same issue. Running without --medvram and am not noticing an increase in used RAM on my system, so it could be the way that the system is transferring data back and forth between system RAM and vRAM, and is failing to clear out the ram as it goes. I am also on Linux and have not tested on Windows.

claychinasky commented 1 year ago

I am having the same issue. Running without --medvram and am not noticing an increase in used RAM on my system, so it could be the way that the system is transferring data back and forth between system RAM and vRAM, and is failing to clear out the ram as it goes. I am also on Linux and have not tested on Windows.

I'm running without --medvram and never ran that argument but having a leak issue. It may be a separate issue as well. I'm on ubuntu 22.04, gradio 3.4.1 and --no-hashing, this is the at least settings to get minimal leaking.

lsaa commented 1 year ago

Ok so I've trying to pinpoint what causes the memory to jump and I found a few things out. I'm on ea9bd9f

without-no-hashing.txt with-no-hashing.txt without --no-hashing it's better than it used to be but still increases over time. however after the first OOM Kill I decided to try something out: on each generation I swap out a TI embed. Seems like that makes fill up memory a lot faster. Honestly could be unrelated since I've had it leak without using TI embeds but having it go up 1.5GB after loading an embed might be a bug

tzwel commented 1 year ago

I can confirm that the problem lies in --medvram for my case

vram

I'm running without --medvram and never ran that argument but having a leak issue. It may be a separate issue as well.

might be

notdelicate commented 1 year ago

I'm having this problem as well. I use --lowvram and I can generate up to 3-4 images until my desktop crashes. In my case I can't run AUTOMATIC1111 without the --lowvram argument so I can't test if that's the problem.

lsaa commented 1 year ago

running without --medvram fixed it for me as well

edit: tested it a bit more. Seems to be very stable however I can only generate smaller pics due to not having --medvram.

tzwel commented 1 year ago

how do you use your VAEs? I might be onto something i put them in the VAE directory and i'm not noticing sudden spikes anymore this could be a coincidence, i'll test it more

tzwel commented 1 year ago

the RAM usage goes up VERY SLOWLY now, i think i am close to finding the cause, but I will need to verify it

lsaa commented 1 year ago

how do you use your VAEs? I might be onto something i put them in the VAE directory and i'm not noticing sudden spikes anymore this could be a coincidence, i'll test it more

tested it and changing it to the VAE folder might have done something but it still crashed after 78 generations. without loading in any embeds or loras I used to get around 55 generations so either this is an outlier or it actually made a difference

claychinasky commented 1 year ago

I'm starting to think this issue might be related to graphics driver / pytorch / xformers / kernel, this is linux and Nvidia after all. (I'm on 525 latest) This might as well separate issue too. Because I have used some other repos to generate which uses pytorch and xformers, after a while my swap is filled, had to reset the swap few times.

lsaa commented 1 year ago

I'm starting to think this issue might be related to graphics driver / pytorch / xformers / kernel, this is linux and Nvidia after all. (I'm on 525 latest) This might as well separate issue too. Because I have used some other repos to generate which uses pytorch and xformers, after a while my swap is filled, had to reset the swap few times.

i'm on AMD but you bring a good point, I'll try to use something else like comfyui to see if it also causes RAM buildup

tzwel commented 1 year ago

it crashed and logged me out now, i don't know why it does that, but after logging back the issue seems to be less annoying

mcmonkey4eva commented 1 year ago

Details about my own testing of the memory leak on my system

I suspect the root of the issue is models loaded into torch are remaining loaded in system RAM after switching away from them.

I suspect the secret to locating the source of the bug lies in investigating what that XYZ plot script does differently from normal generations. Perhaps it bypasses some stage of processing somewhere?

lsaa commented 1 year ago

I suspect the secret to locating the source of the bug lies in investigating what that XYZ plot script does differently from normal generations. Perhaps it bypasses some stage of processing somewhere?

yesterday I was testing large batch count generations. Ran out of memory at the very end of a 100 BC 2 BS gen (no xyz plot), exactly when it should've been generating the txt2img-grid, can confirm that all 200 pics generated successfully and the grid image is not there. On the XYZ plot the grid is not generated in the usual way and the only image added to the result gallery is the plot itself.

2023-02-13_00-27

Currently trying to see if i can avoid the leak on my machine by using the xyz plot. will probably edit later

Edit: Still have a leak while using only the xyz plot script to generate images. No embeds/loras/hypernets or any extensions that aren't built-in. Only arg is --medvram

mcmonkey4eva commented 1 year ago

It's quite likely there are multiple different memory leaks going on, or multiple variants of one root leak.

I note that the leak I narrowed down myself relates to model loading, and the --medvram and --lowvram arguments cause model data to be loaded and unloaded repeatedly while running, which could well be same root cause but different symptoms.

If the root leak is related to the Torch internal code that transfer data between CPU and GPU, that which perfectly explain why medvram/lowvram seem to make it worse, and also explain why precision full makes it worse too.

lsaa commented 1 year ago

OK so I'm also pretty confident its an issue on the torch backend. I tried this fix out https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/6722 and it's running perfectly. note fore arch users: gperftools is built against GLIBCXX 3.4.30 and the arch repos are behind, get an older version from the archives. gperftools 2.9 works for me

mcmonkey4eva commented 1 year ago

That thread fixed it!

sudo apt install libgoogle-perftools-dev then add export LD_PRELOAD=libtcmalloc.so in webui-user.sh

I'm now able to repeat my earlier test and memory grows to 33% of available RAM then stops growing.

lsaa commented 1 year ago

it's not really "fixed" as I highly doubt the issue is with malloc itself. This probably needs to be reported to torch

tzwel commented 1 year ago

it seems to have worked (i'm not entirely sure, the ram still goes up but slowly) on manjaro it was:

sudo pacman -S gperftools 
pamac install downgrade
sudo DOWNGRADE_FROM_ALA=1 downgrade gperftools

choose 2.9.1

in the webui directory:

kate webui-user.sh

paste in

export LD_PRELOAD=libtcmalloc.so
tzwel commented 1 year ago

actually i'm unsure whether i did it right @lsaa mind sharing instructions?

lsaa commented 1 year ago

actually i'm unsure whether i did it right @lsaa mind sharing instructions?

I didnt want to install the older version over the new one since it's a dependency to other stuff I use so I just grabbed 2.9.1 from the official archive

https://archive.archlinux.org/packages/g/gperftools/

extracted it with tar --use-compress-program=unzstd -xvf ~/Downloads/gperftools-2.9.1-2-x86_64.pkg.tar.zst and just launched it with the env var HSA_OVERRIDE_GFX_VERSION=10.3.0 LD_PRELOAD=/home/lsaa/mnt/dockerx/tstlibs/usr/lib/libtcmalloc_minimal.so bash webui.sh

tzwel commented 1 year ago

this seems to work good!

notdelicate commented 1 year ago

That thread fixed it!

sudo apt install libgoogle-perftools-dev then add export LD_PRELOAD=libtcmalloc.so in webui-user.sh

I'm now able to repeat my earlier test and memory grows to 33% of available RAM then stops growing.

Can confirm that this solved the problem. Now I can generate forever, thank you (for writing the instructions).

dathide commented 1 year ago

I didnt want to install the older version over the new one since it's a dependency to other stuff I use so I just grabbed 2.9.1 from the official archive

https://archive.archlinux.org/packages/g/gperftools/

extracted it with tar --use-compress-program=unzstd -xvf ~/Downloads/gperftools-2.9.1-2-x86_64.pkg.tar.zst and just launched it with the env var HSA_OVERRIDE_GFX_VERSION=10.3.0 LD_PRELOAD=/home/lsaa/mnt/dockerx/tstlibs/usr/lib/libtcmalloc_minimal.so bash webui.sh

Thanks, this worked for me. Instead of putting the env vars in the command, I just added them to my webui-user.sh in a couple new lines such as export HSA_OVERRIDE_GFX_VERSION="10.3.0"

Dnak-jb commented 1 year ago

That thread fixed it!

sudo apt install libgoogle-perftools-dev then add export LD_PRELOAD=libtcmalloc.so in webui-user.sh

I'm now able to repeat my earlier test and memory grows to 33% of available RAM then stops growing.

This seems to have stopped the leak for me as well. Now i can run with --medvram without constant issues. Whether or not that is considered a "fix" doesn't matter to me cause now I'm not plagued by a time limit when generating.

julsizeliki commented 1 year ago

for saturnclowd with as ControlNet --share --enable-insecure-extension-access --xformers --no-hashing --lowram \ --gradio-queue --skip-version-check --lowram --disable-safe-unpickle --force-enable-xformers

tzwel commented 1 year ago

for saturnclowd with as ControlNet --share --enable-insecure-extension-access --xformers --no-hashing --lowram --gradio-queue --skip-version-check --lowram --disable-safe-unpickle --force-enable-xformers

what do you mean

yrao1000 commented 1 year ago

I'm running into the same issue even after making all the changes. I'm running my webui on an EC2 instance btw.

tzwel commented 1 year ago

the ram still goes up after the libtcmalloc fix, but more slowly and it's not that drastic, what could be causing this? it's more of a gradual raise than those sharp spikes i had before

tzwel commented 1 year ago

disabling preview images freed me, the ram no longer goes up

WooXinyi commented 1 year ago

preview

so how to solve this problem finally?

WooXinyi commented 1 year ago

That thread fixed it!

sudo apt install libgoogle-perftools-dev then add export LD_PRELOAD=libtcmalloc.so in webui-user.sh

I'm now able to repeat my earlier test and memory grows to 33% of available RAM then stops growing.

Thanks! I fix it too.

tzwel commented 1 year ago

I'm closing this issue because it's probably solved for everyone here The issue seemed to be the --medvram argument, removing it fixed the problem for the most part, and if it didn't (or you wanted to use the argument) installing libtcmalloc.so fixed the problem entirely for most of us. If it still doesn't work, try turning off image previews.

Instructions for every solution are in this issue

yaslack commented 1 year ago

That thread fixed it!

sudo apt install libgoogle-perftools-dev then add export LD_PRELOAD=libtcmalloc.so in webui-user.sh

I'm now able to repeat my earlier test and memory grows to 33% of available RAM then stops growing.

Worked for me thank you

FoxxMD commented 9 months ago

Just wanted to chime in and say the lbtcmalloc method from the earlier comment has fixed my memory leak issue for master branch as of 2023-11-13 hash: f1862579 using docker image holaflenain/stable-diffusion with SD.Next. Disabling image previews made no discernible difference (can keep them on).