[Bug]: Latest updates breaks model merging

BlackWyvern commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

As of the updates two days ago, I am completely unable to merge models now. Before, I could run off and test stuff all day without an issue. More over, as it appears to be giving me an OOM error, I've been monitoring my VRAM, and finding that the memory usage doesn't even move. It's as if it doesn't allocate it at all. Looking at the error, it seems to be a CPU/RAM issue? But that doesn't seem to move either.

Merging... 18%|████████████████▎ | 331/1831 [00:01<00:08, 175.52it/s] Error loading/saving model file: Traceback (most recent call last): File "F:\Downloads\Stable Diffusion\A1 SD\stable-diffusion-webui\modules\ui.py", line 1731, in modelmerger results = modules.extras.run_modelmerger(*args) File "F:\Downloads\Stable Diffusion\A1 SD\stable-diffusion-webui\modules\extras.py", line 321, in run_modelmerger theta_0[key] = theta_func2(a, b, multiplier) File "F:\Downloads\Stable Diffusion\A1 SD\stable-diffusion-webui\modules\extras.py", line 256, in weighted_sum return ((1 - alpha) * theta0) + (alpha * theta1) RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 4718592 bytes.

Taskmgr_B1DUUwQHSI

I'm running a 3080 on W10 and have had no issues until these last couple days of commits.

Steps to reproduce the problem

Go to checkpoint merger
Line up your models any way you please (saving as ckpt or safetensor doesn't seem to make a difference, nor does saving it as f16 or f32)
Press run
Don't profit

What should have happened?

Received bacon.

Commit where the problem happens

874b975bf8438b2b5ee6d8540d63b2e2da6b8dbd

What platforms do you use to access UI ?

Windows

What browsers do you use to access the UI ?

Mozilla Firefox

Command Line Arguments

--xformers --opt-split-attention --medvram --vae-path "models/Stable-diffusion/Vae Weights/MSESharpening.vae.pt"

Additional information, context and logs

No response

mezotaken commented 1 year ago

1) why are you monitoring GPU memory, when it's your RAM being drained. 2) how much RAM do you have? 3) which two models are you trying to merge? I can barely merge hassanblend with something normal, but trying to merge pfg with some other model based on pfg is impossible for me, because of 7gb size.

Ok, so you checked RAM usage too, but it wouldnt be shown on the monitor, because memory was never allocated. it tried to allocate some space required to load second model, and when it turned out to be too much, exception was thrown. If it was still from the very beginning to the end, that's very suspicious. At least one model had to fit.

It is totally possible that something was broken, but first we need to make sure that what you're merging now was possible before, and that your available RAM is enough to perform merge.

BlackWyvern commented 1 year ago

I've got 32G of ram, which is usually barely over 11 in use. Been trying to merge things like AnalogDiffusion (~2G), ArtofMTG (~2G), Anything3 (~2G), SD1.5 (~4G), FurryE15/18 (~4G) All of which I've been able to make successful merges of since merging was a thing.

mezotaken commented 1 year ago

Yep, definitely something wrong then. 32G should be more than enough. I cannot reproduce it yet, sadly. As i said, merged hassanblend (~6gb) with sd1-5(~4gb) and it was fine on last commit. Can you send your RAM usage monitor readings? Or just tell if they're completely flat through entire process, from start to exception? Oh, btw, i just noticed it happens in the middle of the process, that's weird.

BlackWyvern commented 1 year ago

Merging... 53%|████████████████████████████████████████████████ | 966/1831 [00:10<00:09, 88.45it/s] Error loading/saving model file: Traceback (most recent call last): File "F:\Downloads\Stable Diffusion\A1 SD\stable-diffusion-webui\modules\ui.py", line 1758, in modelmerger results = modules.extras.run_modelmerger(*args) File "F:\Downloads\Stable Diffusion\A1 SD\stable-diffusion-webui\modules\extras.py", line 321, in run_modelmerger theta_0[key] = theta_func2(a, b, multiplier) File "F:\Downloads\Stable Diffusion\A1 SD\stable-diffusion-webui\modules\extras.py", line 256, in weighted_sum return ((1 - alpha) * theta0) + (alpha * theta1) RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 26214400 bytes.

Taskmgr_aOy028JfFz

It initially merged two models correctly, but after running a couple txt2img and img2img instances, it immediately goes back to doing it.

BlackWyvern commented 1 year ago

Error loading/saving model file: Traceback (most recent call last): File "F:\Downloads\Stable Diffusion\A1 SD\stable-diffusion-webui\modules\ui.py", line 1689, in modelmerger results = modules.extras.run_modelmerger(*args) File "F:\Downloads\Stable Diffusion\A1 SD\stable-diffusion-webui\modules\extras.py", line 379, in run_modelmerger safetensors.torch.save_file(theta_0, output_modelname, metadata={"format": "pt"}) File "F:\Downloads\Stable Diffusion\A1 SD\stable-diffusion-webui\venv\lib\site-packages\safetensors\torch.py", line 71, in save_file serialize_file(_flatten(tensors), filename, metadata=metadata) File "F:\Downloads\Stable Diffusion\A1 SD\stable-diffusion-webui\venv\lib\site-packages\safetensors\torch.py", line 236, in _flatten return { File "F:\Downloads\Stable Diffusion\A1 SD\stable-diffusion-webui\venv\lib\site-packages\safetensors\torch.py", line 240, in <dictcomp> "data": _tobytes(v, k), File "F:\Downloads\Stable Diffusion\A1 SD\stable-diffusion-webui\venv\lib\site-packages\safetensors\torch.py", line 210, in _tobytes return data.tobytes() MemoryError

Getting this new error on 45a8b758a7bcb144242aee710dfcd1aedcf30b7f

If it merges, which is still big if, it then gives this error and halts. Only seems to apply to safetensor though. If it's feeling like merging, it can merge as checkpoint fine.

I've fully clean reinstalled my nvidia drivers, rebuilt the repo, and ran both memtest and windows memory checks.

BlackWyvern commented 1 year ago

A tentative fix so far, apparently my pagefile was to blame. Windows did that thing where it makes a pagefile and promptly forgets to use it. I deleted it and refreshed the settings, so far I've been able to merge several models in both ckpt and safetensor.

Will keep an eye on it for the moment.

possibleimprobable commented 1 year ago

@BlackWyvern I'm having the same issue, where is the pagefile so I can delete it and does that refresh the settings or how else would I do that afterward?

BlackWyvern commented 1 year ago

(I'm on W10 so, YMMV) On the start menu, right click on This PC, scroll down to advanced system settings, advanced tab, performance, advanced, virtual memory.

Set everything to none, restart, go back in, set it to system managed, or give it a few gigs manually, restart again.

It's a windows thing and doesn't affect Auto

Bizori commented 1 year ago

I'm having the same issue too.

D4di69 commented 1 year ago

same issue

SilverRider76 commented 1 year ago

From today (I've upgraded) I got the same error. Since yesterday it work perfectly!

AUTOMATIC1111 / stable-diffusion-webui