[Bug]: SD WebUI memleak when switching models

DistantThunder commented 9 months ago

Checklist

[X] The issue exists after disabling all extensions
[X] The issue exists on a clean installation of webui
[ ] The issue is caused by an extension, but I believe it is caused by a bug in the webui
[X] The issue exists in the current version of the webui
[X] The issue has not been reported before recently
[ ] The issue has been reported before but has not been fixed yet

What happened?

SD WebUI regularly gets OoM-killed on system RAM when using different models, even when they are supposedly already loaded in VRAM. It appears multiple copies of the models end up being copied to system RAM.

Steps to reproduce the problem

Run SDUI
Go to settings and enable 2 or more models to be kept in VRAM simultaneously
Load a model, and use it.
Load a second model and use it.
Re-load the first model and use it.
repeat... Over those load, even then RAM consumption on image generation remains under control, whenever you switch models, several GB of additional RAM are consumed and never released

What should have happened?

Ideally, I believe

If VRAM is large enough, WebUI should have one copy of the model on the GPU VRAM and at most 1 copy in RAM.
Between RAM and VRAM, 1 copy of a model should be loaded at a time. If the threshold of copies to be kept in memory is exceeded, occupied RAM should be released.

TLDR: SDUI should avoid keeping the same model in VRAM and RAM simultaneously. SDUI should respect the maximum amount of model data to be kept in VRAM & RAM combined.

What browsers do you use to access the UI ?

Mozilla Firefox

Sysinfo

sysinfo.txt

Console logs

* [console_log_sdui_bug.log](https://github.com/AUTOMATIC1111/stable-diffusion-webui/files/13692630/console_log_sdui_bug.log)

Additional information

Running in a container with ROCm 5.7:

version: v1.7.0-2-g007d64b6 • python: 3.10.12 • torch: 2.2.0.dev20231120+rocm5.7

w-e-w commented 9 months ago

I wasn't able to reproduce the issue with a 3090 on windows this is possibly a AMD ROCm issue

also I cannot find commit 007d64b66eb8eb95c49bea779fcee1274fe6a2b7 so I'm not sure what version you are using

DistantThunder commented 9 months ago

Ah yes sorry, I'm using a custom local branch for Docker build. I'm in fact on commit cd45635f537083f7bede39b5cb196d27b5cf2307.

test.webm

w-e-w commented 9 months ago

is this a new issue in 1.7 or an old issue form 1.6?

DistantThunder commented 9 months ago

I believe the issue was present on 1.6 as well. But it also had other issues so it's hard to be reliable on that.

Aamir3d commented 9 months ago

This issue wasn't present in 1.6, I'm seeing this in 1.7 - Local install on Windows with an Nvidia 3060. Checkpoints start filling up all available system RAM. I know a couple other users on Reddit are seeing it as well.

Edit - this is intermittent, it happens sometimes. I loaded and unloaded some checkpoints and it didn't happen just now, so unable to provide a log trace.

Edit 2 - Able to reproduce this Steps

Load SDXL checkpoint (Juggernaut XL) - generate an image
Load 1.5 / SDXL checkpoint (Anything) - Generate an image
Repeat Steps 1/2 - the application starts filling up all available system RAM each time a checkpoint is loaded/unloaded.

FyzzLive commented 9 months ago

@Aamir3d Install docs updated it may be what you're looking for, it was doing the same thing for me. But not when switching models, it would at each generation save the currently used model to ram, and then over time just keep building up more and more, so it would look similar to what you're explaining.

If you're on windows using auto install maybe try pulling the repo again and updating it. Or figuring out how to get the tcmalloc4 lib on windows

If you're in a container or linux follow these new install instructions:


# Debian-based:
sudo apt install wget git python3 python3-venv libgl1 libglib2.0-0
# Red Hat-based:
sudo dnf install wget git python3 gperftools-libs libglvnd-glx 
# openSUSE-based:
sudo zypper install wget git python3 libtcmalloc4 libglvnd
# Arch-based:
sudo pacman -S wget git python3```

Aamir3d commented 9 months ago

@Aamir3d Install docs updated it may be what you're looking for, it was doing the same thing for me. But not when switching models, it would at each generation save the currently used model to ram, and then over time just keep building up more and more, so it would look similar to what you're explaining.

If you're on windows using auto install maybe try pulling the repo again and updating it. Or figuring out how to get the tcmalloc4 lib on windows

If you're in a container or linux follow these new install instructions:
# Debian-based:
sudo apt install wget git python3 python3-venv libgl1 libglib2.0-0
# Red Hat-based:
sudo dnf install wget git python3 gperftools-libs libglvnd-glx 
# openSUSE-based:
sudo zypper install wget git python3 libtcmalloc4 libglvnd
# Arch-based:
sudo pacman -S wget git python3```

Thank you @FyzzLive for your comment. I'm on Windows. The problem is that this is intermittent. Happened once, but didn't happen again. Then it happened yesterday. I wish there was a way to consistently track what was going on here. Not a big deal to restart or even reinstall the WebUI, but when you've got a few extensions installed and configurations set, it becomes problematic.

retouchvolkov commented 9 months ago

The same problem is observed on Debian 12, GeForce 3060, intel processor

ZhiYing-Yang commented 8 months ago

I've encountered the same issue. The sd-webui is running in Docker. Switching between LoRa and controlnet models leads to a memory leak. This issue exists in both versions 1.6 and 1.7. Is there any solution for this?

Nuck-TH commented 8 months ago

I have same issue as OP on 1.7 with both release pytorch-rocm from 5.6 and last nightly from 5.7. Devuan Ceres(debian Sid) with last updates.

AUTOMATIC1111 / stable-diffusion-webui