Panchovix / stable-diffusion-webui-reForge

GNU Affero General Public License v3.0
351 stars 17 forks source link

Lora High Vram Load #51

Closed Lesteriax closed 3 months ago

Lesteriax commented 3 months ago

Checklist

What happened?

I noticed after the last git pull that when I load any lora, my GPU memory double as opposed to 2 days ago for example. I usually load two models in Vram and I used to generate normally with no issues but after the last git pull, I started getting out of memory.

I will provide below 3 pictures to show how the GPU loads

1- This is a fresh start, 1 model is loaded, no loras used (Here, I can generate and the gpu memeory usage would revert back normally after cleanup automatically) 1

2- This is when I generated by adding 1 lora, notice that gpu has doubled and did not revert back to gpu load 2

3- Here is what happens after I removed the lora, it reverted back or reduced gpu memory usage by a lot 3

Steps to reproduce the problem

.

What should have happened?

.

What browsers do you use to access the UI ?

No response

Sysinfo

.

Console logs

model_type EPS
UNet ADM Dimension 2816
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 48.3s (prepare environment: 11.5s, import torch: 3.4s, import gradio: 0.9s, setup paths: 1.0s, initialize shared: 0.2s, other imports: 0.6s, load scripts: 3.0s, create ui: 2.6s, gradio launch: 24.5s, add APIs: 0.5s).
Using pytorch attention in VAE
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
Using pytorch attention in VAE
extra {'cond_stage_model.clip_l.text_projection', 'cond_stage_model.clip_l.logit_scale', 'cond_stage_model.clip_g.transformer.text_model.embeddings.position_ids'}
loaded straight to GPU
To load target model SDXL
Begin to load 1 model
Moving model(s) has taken 0.04 seconds
To load target model SDXLClipModel
Begin to load 1 model
Moving model(s) has taken 0.61 seconds
Model loaded in 32.4s (load weights from disk: 0.6s, forge load real models: 30.8s, calculate empty prompt: 0.8s).
[LORA] Loaded D:\ai-webuis\stable-diffusion-webui-reForge\models\Lora\SDXL\3-Keep\add-detail-xl.safetensors for SDXL-UNet with 722 keys at weight 1.0 (skipped 0 keys)
[LORA] Loaded D:\ai-webuis\stable-diffusion-webui-reForge\models\Lora\SDXL\3-Keep\add-detail-xl.safetensors for SDXL-CLIP with 264 keys at weight 1.0 (skipped 0 keys)
To load target model SDXLClipModel
Begin to load 1 model
Reuse 1 loaded models
Moving model(s) has taken 0.48 seconds
To load target model SDXL
Begin to load 1 model
Reuse 1 loaded models
Moving model(s) has taken 0.84 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:05<00:00,  3.67it/s]
To load target model AutoencoderKL█████████████████████████████████████████████████████| 20/20 [00:04<00:00,  3.83it/s]
Begin to load 1 model
Moving model(s) has taken 0.06 seconds
Total progress: 100%|██████████████████████████████████████████████████████████████████| 20/20 [00:05<00:00,  3.43it/s]
To load target model SDXLClipModel█████████████████████████████████████████████████████| 20/20 [00:05<00:00,  3.83it/s]
Begin to load 1 model
Reuse 1 loaded models
Moving model(s) has taken 0.01 seconds
To load target model SDXL
Begin to load 1 model
Reuse 1 loaded models
Moving model(s) has taken 0.04 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:05<00:00,  3.82it/s]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 20/20 [00:05<00:00,  3.49it/s]
[LORA] Loaded D:\ai-webuis\stable-diffusion-webui-reForge\models\Lora\SDXL\3-Keep\add-detail-xl.safetensors for SDXL-UNet with 722 keys at weight 1.0 (skipped 0 keys)
[LORA] Loaded D:\ai-webuis\stable-diffusion-webui-reForge\models\Lora\SDXL\3-Keep\add-detail-xl.safetensors for SDXL-CLIP with 264 keys at weight 1.0 (skipped 0 keys)
To load target model SDXLClipModel
Begin to load 1 model
Reuse 1 loaded models
Moving model(s) has taken 0.38 seconds
To load target model SDXL
Begin to load 1 model
Reuse 1 loaded models
Moving model(s) has taken 0.38 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:05<00:00,  3.85it/s]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 20/20 [00:05<00:00,  3.52it/s]
To load target model SDXLClipModel█████████████████████████████████████████████████████| 20/20 [00:05<00:00,  3.84it/s]
Begin to load 1 model
Reuse 1 loaded models
Moving model(s) has taken 0.01 seconds
To load target model SDXL
Begin to load 1 model
Reuse 1 loaded models
Moving model(s) has taken 0.04 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:05<00:00,  3.85it/s]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 20/20 [00:05<00:00,  3.49it/s]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 20/20 [00:05<00:00,  3.85it/s]

Additional information

No response

Panchovix commented 3 months ago

Hi there, thanks for the report. Are you on main or dev_upstream branch? Can you send me the commit where the high vram load didn't happen?

Lesteriax commented 3 months ago

Hey @Panchovix, after extensive get checkouts without being able to reproduce, I figured the issue was from my end with the command arguments.

Issue where this happened was using these arguments: --always-gpu --disable-nan-check --disable-xformers --attention-pytorch

Correct arguments I used previously which were perfect: --xformers --always-gpu --disable-nan-check --cuda-malloc --cuda-stream --pin-shared-memory

Not sure if this information might help you but I was able to reproduce it by changing between these two arguments.

Thank you

Panchovix commented 3 months ago

Maybe something related to SDPA or another cross optimization is doing that issue, but glad you could find out. Gotta have it noted in case it happens again.

Thanks for the update!

mweldon commented 3 months ago

I have this issue and was about to post a new issue about it. When I use a Pony model and a Lora, the VRAM shoots up and everything slows way down. I try the exact same parameters and models on the latest Auto1111 and this does not happen. I can only reproduce it with Pony models, which seems... strange? It's 100% reproducible though.

Fails with https://civitai.com/models/458760/bemypony and any Lora. Works fine without a Lora, or with https://civitai.com/models/299933?modelVersionId=638622 with or without a Lora.

Positive: score_9, score_8_up, score_7_up, score_6_up, a medium closeup color portrait photo of mwlexi wearing a bra on a greek island Negative: score_6, score_5, score_4, worst quality, low quality, ugly, deformed Steps: 30 CFG: 8 DPM++ 2M Automatic 896x1152 No fancy stuff turned on

Cmd line: set COMMANDLINE_ARGS=--api --xformers --always-gpu --disable-nan-check --cuda-stream --pin-shared-memory --cuda-malloc

This is with the dev-upstream branch.

Panchovix commented 3 months ago

@mweldon

It is interesting it happens only on pony models. What GPU do you have? If it has 12GB VRAM I think --pin-shared-memory and --always-gpu do more harm on this case since it will use a lot more of VRAM to not move the model around (and A1111 doesn't have equivalents args for this)

Wondering, do you get that issue on main branch? dev_upstream has kinda a different model management as well that comes from comfy upstream changes.

mweldon commented 3 months ago

Removing --always-gpu fixes it. Thanks.

Also I noticed that Comfy has the same issue so I wonder if there's some command line that I need to change for that too.