Not freeing RAM when changing between checkpoints

FerrahWolfeh commented 2 years ago

Describe the bug When you start the webui with a X checkpoint, it fills the system RAM into a certain ammount. Now if you change the checkpoint to Y in the webui, the RAM usage increases as if both models are loaded. If you change back to checkpoint X, RAM usage remains unusually high and the system begins to violently swap as soon as you start to generate images.

To Reproduce Steps to reproduce the behavior:

Start webui.sh with any model (eg. Waifu-Diffusion 1.3) and measure the system RAM once the startup finishes
use the selector on the top of the page to change to another model (eg. Stable-Diffusion 1.4)
Change back to the first model and check RAM again.

Expected behavior As soon as you switch the checkpoints, the program should free most of the memory used for the currently loaded model and fill it with the new selected model. And when you switch back to the first checkpoint, the memory should again be freed and filled to about a similar amount when the program had when it was freshly started.

Screenshots Here are some screenshots of the memory usage of my system (notice the used and available columns)

Base system usage (only firefox open with some active youtube tabs)

Usage right after initialization

Usage after switching to another checkpoint

Usage after switching back to first model

Desktop (please complete the following information):

OS: Arch Linux 5.19.13
Browser: Firefox
Commit revision ce37fdd30e9fc0fe0bc5805a068ce8b11b42b5a3

Additional context This is most visible on a system where you don't have much RAM to begin with (16GB in my case) and the effects are visible even without generating anything. It gets worse if you begin switching checkpoints between generations

bmaltais commented 2 years ago

This probably explain why loading a new model usually crash webui after 7 or 8 model swap on my system with 16gb RAM allocated to WSL2.

CoffeeMomoPad commented 2 years ago

Experiencing the same thing on 16GB RAM, did not happen before till now

TechOtakupoi233 commented 2 years ago

When loading a new ckpt, the program will start loading the new ckpt, but left the old one in VRAM and RAM. I have only 6GB VRAM, It can't hold two models at once, it will be nice if the program free up VRAM and RAM BRFORE loading a new ckpt.

nerdyrodent commented 2 years ago

Same here. Switching models uses more and more RAM. I've tried changing "Checkpoints to cache in RAM", but it appears to make no difference.

anonymous721 commented 2 years ago

It's causing me a lot of annoyance too. 15-20 minutes testing a couple different models, and I'm already over 20GB of system RAM used.

RandomLegend commented 2 years ago

This is a serious issue for me now. I have 16GB of ram and i never had any issues switching between models before.

I ran an old version of webui perfectly fine, upgraded to newest git because of new features and now i cannot even swap a model once. It will just crash violently.

GeorgiaM-honestly commented 2 years ago

Hello,

I'm trying to replicate this by using, as your screenshots say, 15 GiB of ram. My setup differs: the bare metal OS is gentoo linux, then I'm using a qemu VM with devuan (debian without systemd), and within that, auto running inside of a docker container. Hopefully that added complexity doesn't screw my testing up.

And yes you are seeing correct: I don't have swap. I didn't bother because this stuff lives on a host that has 128GB of ram and I'm able to just dial in whatever I want to give to the VM.

The formatting here is getting completely hosed, I'm not sure what is going on, sorry about that.

After initial start and before visiting the ui, which here loads the standard 1.5 model ( v1-5-pruned-emaonly.ckpt | 81761151 ):

GiB:

           total        used        free      shared  buff/cache   available

Mem: 14 5 4 0 4 8 Swap: 0 0 0

After visiting the UI and switching the model to the standard v1.4 ( 7460a6fa ):

GiB:

           total        used        free      shared  buff/cache   available

Mem: 14 8 1 0 4 5 Swap: 0 0 0

After switching back to the standard 1.5 model ( v1-5-pruned-emaonly.ckpt | 81761151 ):

GiB:

           total        used        free      shared  buff/cache   available

Mem: 14 8 1 0 4 5 Swap: 0 0 0

As such I am not able to replicate this. Please let me know if I missed something, or if you'd like me to try something else! You can also look into zram / compressed ram on linux, it is a handy and tuneable set of options which begins to compress the oldest ram contents (gently and more heavily if resources continue to run out) with the goal of delaying when the very slow swap space is used.

0xdevalias commented 2 years ago

The formatting here is getting completely hosed, I'm not sure what is going on, sorry about that.

@GeorgiaM-honestly Have you wrapped it in triple backticks to make it a code block? (```)

Random thought/musing/not sure if this actually relates to how things are done in the code at all, but is the model ckpt hash used for caching it anywhere (or was it at some point in the past)? I know there are some other issues here (~can't remember the links off the top of my head~ see link below) that were talking about different model ckpts that had the same hash, even though they were different. I'm wondering if perhaps switching back and forth between models with that 'hash clash' might somehow be causing this memory leak?

Edit:

This one:

I ran an old version of webui perfectly fine, upgraded to newest git because of new features and now i cannot even swap a model once. It will just crash violently.

@RandomLegend Is this swapping between any models at all? Are you able to provide the model hashes for some of the models that cause it to crash? Do they happen to have the same hash as per my theory above by chance?

Also, this is a separate issue, but I saw it linked here, and wanted to backlink to it in case it's relevant:

https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/2264

And this one may also be related:

https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/3449

Changing to an inpainting model is calling the load_model() and creating a new model, but the previous model is not being removed from memory, even calling gc.collect() is not removing the old model from memory.

So if you keep changing from inpainting to not inpainting or vice versa the leak keep increasing.

Originally posted by @jn-jairo in https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/3449#issuecomment-1287999085

The fact that gc.collect() doesn't clear the old model is interesting however. This means that something is keeping a pointer to the old model alive and preventing it from being cleaned up.

Originally posted by @random-thoughtss in https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/3449#issuecomment-1288046296

Just to notify the progress I made, It is indeed a reference problem, some places are keeping a reference of the model, what prevents the garbage collector to free the memory.

I am checking it with ctypes.c_long.from_address(id(shared.sd_model)).value and there are multiples references.

I am eliminating the references but there are still some to find, It will take a while to find everything.

Originally posted by @jn-jairo in https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/3449#issuecomment-1292994606

0xdevalias commented 2 years ago

Looking at the 'references timeline' on https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/3449 also pointed me to this PR by @jn-jairo that was merged ~9 days ago:

@GeorgiaM-honestly I wonder if that's why you can't replicate the issues here anymore?

@RandomLegend have you updated to a version of the code that has that fix merged, and if so, are you still seeing issues despite it?

0xdevalias commented 1 year ago

@0xdevalias when i observed and reported this issue i was on the latest code, yes.

However i just completely wiped the installation, with the venv and the repos and reinstalled from scratch. That fixed it. I assume it was some incompatibility with some old stuff laying around that wasn't cleared in recent commits.

Originally posted by @RandomLegend in https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/2264#issuecomment-1309604724

tzwel commented 1 year ago

how to downgrade?

clementine-of-whitewind commented 1 year ago

pliese fix

Coderx7 commented 1 year ago

Im having the same issue on the latest commit. I never had this issue and it just popped up out of nowhere! I'm on an ubuntu 22.04 with 32GB of RAM( and no swap) and

Python revision: 3.9.7 (default, Sep 16 2021, 13:09:58) 
[GCC 7.5.0]
Dreambooth revision: 9f4d931a319056c537d24669cb950d146d1537b0
SD-WebUI revision: 68f336bd994bed5442ad95bad6b6ad5564a5409a

Checking Dreambooth requirements...
[+] bitsandbytes version 0.35.0 installed.
[+] diffusers version 0.10.2 installed.
[+] transformers version 4.25.1 installed.
[+] xformers version 0.0.16rc425 installed.
[+] torch version 1.13.1+cu117 installed.
[+] torchvision version 0.14.1+cu117 installed.

side note: I did install google-perftools and then removed it thinking it may have sth to do with it. nothing changed

catboxanon commented 1 year ago

The dev branch and upcoming 1.6.0 may have resolved this with the rework in https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/b235022c615a7384f73c05fe240d8f4a28d103d4. I'm going to leave this open for the time being but for those that would like to test it out earlier you can switch to the dev branch to do so.

Avsynthe commented 1 year ago

Hey all. I'm having this issue also. I'm using 1.6.0 and it never releases RAM. The more I generate, the higher it goes.

The server went down today and I couldn't figure out why the last snapshot of the system showed 99% memory used of 64GB. I realised SD is just compounding away. This happens no matter what model I use with VAE models increasing it quicker for obvious reasons. Switching models makes no difference, it just continues on.

I've had to limit SD to 20GB RAM and so it'll eventually crash when it hits.

Wynneve commented 1 year ago

@Avsynthe Hello there! I've been having the same issue for entire day now and it seems like I found a “solution”. I've tried switching some settings in the webui, changing my CUDA toolkit version in the PATH, changing the version of CUDA of PyTorch, updating to the “dev” branch of the webui, etc... Nothing worked.

Then I realized that I had updated PyTorch before this problem appeared and then I tried to downgrade to PyTorch 2.0.1. And it worked! No more memory leak, now it properly offloads the weights from the RAM to VRAM and vice versa each generation.

For your convenience, here is the command for installing this previous version of PyTorch: pip3 install torch==2.0.1 torchvision --index-url https://download.pytorch.org/whl/cu118 As I remember, I've deleted it before reinstalling, so, if it refuses to downgrade, you can manually remove it before executing the command: pip3 uninstall torch torchvision

Seems like it's more an issue of the new PyTorch on its own, something related to moving tensors between devices.

If you aren't using Torch 2.1.0, well, my sincere apologies for not helping you :(

DanielXu123 commented 7 months ago

@Avsynthe Same thing in Linux, added to 100GB RAM , is there any possible solutions?

DanielXu123 commented 7 months ago

@Wynneve I reinstalled torch from 2.1.0 to 2.0.1 , but it shows mu xformers cannot be activate correctly, could you please help to check what your xfomers version?

AUTOMATIC1111 / stable-diffusion-webui

Not freeing RAM when changing between checkpoints #2180