Panchovix / stable-diffusion-webui-reForge

GNU Affero General Public License v3.0
287 stars 10 forks source link

[Bug]: Poor SDXL/hi-resolution performance since fd78b06 #74

Closed StylusEcho closed 1 month ago

StylusEcho commented 1 month ago

Checklist

What happened?

Since the above commit on dev_upstream, at high resolutions (eg. 2048x2048) with SDXL, generation speed seems significantly worse, and it runs out of memory for regular VAE decoding and has to fall back on tiled decoding.

When I try the commit right before that one, these issues are gone.

I looked at the console and one of the issues seems to be that AutoencoderKL is now using double the memory as it used to, unless i'm reading it wrong. For the hires fix pass denoising at 2048x2048 the iteration speed was 1.75s/it vs 3.19s/it now. The overall speed of the job went down from 1.83s/it to 5.89s/it! Holy moley...

Steps to reproduce the problem

  1. Checkout dev_upstream or commit fd78b06bf17a92fa3c96e17994c442b96409f77a
  2. Start reForge
  3. Run a test generation with SDXL and Hires fix (final resolution 2048x2048)
  4. Shut down reForge in console but leave browser page open
  5. Checkout 5a53f0364b301671e783c32535ae048391b48b4b
  6. Start reForge again and wait for it to become ready
  7. Open the previous browser page and hit generate again

What should have happened?

Should see some improvement to iteration speed and better VRAM usage, not worse.

What browsers do you use to access the UI ?

Mozilla Firefox, Microsoft Edge

Sysinfo

sysinfo.txt

Console logs

---- Commit fd78b06bf17a92fa3c96e17994c442b96409f77a ----

[Memory Management] Current Free GPU Memory (MB) =  8971.06884765625
[Memory Management] Model Memory (MB) =  4897.086494445801
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  3049.982353210449
Moving model(s) has taken 2.08 seconds
100%|██████████| 25/25 [00:11<00:00,  2.19it/s]
To load target model AutoencoderKL
Begin to load 1 model
[Memory Management] Current Free GPU Memory (MB) =  8970.58349609375
[Memory Management] Model Memory (MB) =  319.11416244506836
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  7627.469333648682
Moving model(s) has taken 1.38 seconds
Cleanup minimal inference memory.
tiled upscale: 100%|██████████| 25/25 [00:04<00:00,  5.81it/s]
To load target model AutoencoderKL
Begin to load 1 model
[Memory Management] Current Free GPU Memory (MB) =  8824.82568359375
[Memory Management] Model Memory (MB) =  319.11416244506836
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  7481.711521148682
Moving model(s) has taken 0.07 seconds
To load target model SDXL
Begin to load 1 model
[Memory Management] Current Free GPU Memory (MB) =  8918.8251953125
[Memory Management] Model Memory (MB) =  4897.086494445801
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  2997.738700866699
Moving model(s) has taken 1.31 seconds
100%|██████████| 25/25 [01:19<00:00,  3.19s/it]
To load target model AutoencoderKL
Begin to load 1 model
[Memory Management] Current Free GPU Memory (MB) =  8915.8251953125
[Memory Management] Model Memory (MB) =  319.11416244506836
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  7572.711032867432
Moving model(s) has taken 1.85 seconds
WARNING:root:Warning: Ran out of memory when regular VAE decoding, retrying with tiled VAE decoding.
postprocess_batch

0: 640x640 1 face, 6.4ms
Speed: 6.0ms preprocess, 6.4ms inference, 61.8ms postprocess per image at shape (1, 3, 640, 640)
WARNING:root:Sampler Scheduler autocorrection: "Euler a" -> "Euler a", "None" -> "Automatic"
Cleanup minimal inference memory.
tiled upscale: 100%|██████████| 9/9 [00:01<00:00,  6.41it/s]
To load target model AutoencoderKL
Begin to load 1 model
[Memory Management] Current Free GPU Memory (MB) =  8923.84228515625
[Memory Management] Model Memory (MB) =  319.11416244506836
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  7580.728122711182
Moving model(s) has taken 0.06 seconds
To load target model SDXLClipModel
Begin to load 1 model
[Memory Management] Current Free GPU Memory (MB) =  8959.091796875
[Memory Management] Model Memory (MB) =  2144.3546981811523
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  5790.737098693848
Moving model(s) has taken 0.52 seconds
To load target model SDXL
Begin to load 1 model
[Memory Management] Current Free GPU Memory (MB) =  8930.9541015625
[Memory Management] Model Memory (MB) =  4897.086494445801
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  3009.867607116699
Moving model(s) has taken 1.49 seconds
100%|██████████| 13/13 [00:05<00:00,  2.45it/s]
To load target model AutoencoderKL
Begin to load 1 model
[Memory Management] Current Free GPU Memory (MB) =  8956.46875
[Memory Management] Model Memory (MB) =  319.11416244506836
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  7613.354587554932
Moving model(s) has taken 1.26 seconds
Total progress: 100%|██████████| 50/50 [04:54<00:00,  5.89s/it]t]

---- Commit 5a53f0364b301671e783c32535ae048391b48b4b ----

[Memory Management] Current Free GPU Memory (MB) =  8971.06884765625
[Memory Management] Model Memory (MB) =  4897.086494445801
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  3049.982353210449
Moving model(s) has taken 2.03 seconds
100%|██████████| 25/25 [00:09<00:00,  2.63it/s]
To load target model AutoencoderKL
Begin to load 1 model
[Memory Management] Current Free GPU Memory (MB) =  8968.58349609375
[Memory Management] Model Memory (MB) =  159.55708122253418
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  7785.026414871216
Moving model(s) has taken 1.33 seconds
Cleanup minimal inference memory.
tiled upscale: 100%|██████████| 25/25 [00:04<00:00,  5.75it/s]
To load target model AutoencoderKL
Begin to load 1 model
[Memory Management] Current Free GPU Memory (MB) =  8864.82568359375
[Memory Management] Model Memory (MB) =  159.55708122253418
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  7681.268602371216
Moving model(s) has taken 0.05 seconds
To load target model SDXL
Begin to load 1 model
[Memory Management] Current Free GPU Memory (MB) =  8911.3251953125
[Memory Management] Model Memory (MB) =  4897.086494445801
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  2990.238700866699
Moving model(s) has taken 1.22 seconds
100%|██████████| 25/25 [00:43<00:00,  1.75s/it]
To load target model AutoencoderKL
Begin to load 1 model
[Memory Management] Current Free GPU Memory (MB) =  8907.8251953125
[Memory Management] Model Memory (MB) =  159.55708122253418
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  7724.268114089966
Moving model(s) has taken 1.56 seconds
postprocess_batch

0: 640x640 1 face, 6.5ms
Speed: 6.6ms preprocess, 6.5ms inference, 58.9ms postprocess per image at shape (1, 3, 640, 640)
WARNING:root:Sampler Scheduler autocorrection: "Euler a" -> "Euler a", "None" -> "Automatic"
Cleanup minimal inference memory.
tiled upscale: 100%|██████████| 9/9 [00:01<00:00,  5.94it/s]
To load target model AutoencoderKL
Begin to load 1 model
[Memory Management] Current Free GPU Memory (MB) =  8927.1533203125
[Memory Management] Model Memory (MB) =  159.55708122253418
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  7743.596239089966
Moving model(s) has taken 0.05 seconds
To load target model SDXLClipModel
Begin to load 1 model
[Memory Management] Current Free GPU Memory (MB) =  8950.52783203125
[Memory Management] Model Memory (MB) =  2144.3546981811523
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  5782.173133850098
Moving model(s) has taken 0.51 seconds
To load target model SDXL
Begin to load 1 model
[Memory Management] Current Free GPU Memory (MB) =  8922.39013671875
[Memory Management] Model Memory (MB) =  4897.086494445801
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  3001.303642272949
Moving model(s) has taken 1.66 seconds
100%|██████████| 13/13 [00:04<00:00,  2.72it/s]
To load target model AutoencoderKL
Begin to load 1 model
[Memory Management] Current Free GPU Memory (MB) =  8947.65478515625
[Memory Management] Model Memory (MB) =  159.55708122253418
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  7764.097703933716
Moving model(s) has taken 1.29 seconds
Total progress: 100%|██████████| 50/50 [01:31<00:00,  1.83s/it]t]

Additional information

Test generation had ADetailer turned on. Have recently enabled the --cuda-stream and --pin-shared-memory flags but issue was already happening prior to that.

Panchovix commented 1 month ago

Hi there, thanks for the report. Did you try running with -vae-in-fp16 or --vae-in-bf16? If not specifying, now it defaults to float32 (for more compatibility with other devices), if having a NVIDIA GPU (you seem to have a RTX 3080, so this should work)

Can you try with either these flags?

rcas45 commented 1 month ago

Can confirm, https://github.com/Panchovix/stable-diffusion-webui-reForge/commit/fd78b06bf17a92fa3c96e17994c442b96409f77a is ~20% slower than https://github.com/Panchovix/stable-diffusion-webui-reForge/commit/5a53f0364b301671e783c32535ae048391b48b4b in my default settings, -vae-in-fp16 or --vae-in-bf16 flags help a little with the tiling at the end, but the generation speed still suffers

Panchovix commented 1 month ago

Okay, pushed a new commit doing a lot of reversion from that update

https://github.com/Panchovix/stable-diffusion-webui-reForge/commit/6632a9dc6f734677e30a24929d7f5a7057687996

Can you try if the performance is as expected now?

rcas45 commented 1 month ago

Did try checkout https://github.com/Panchovix/stable-diffusion-webui-reForge/commit/6632a9dc6f734677e30a24929d7f5a7057687996, but got some errors in the cmd (with and without any flags):

Version: f0.0.20.1dev-v1.10.0RC-latest-828-g6632a9dc
Commit hash: 6632a9dc6f734677e30a24929d7f5a7057687996

Traceback (most recent call last):
  File "C:\Users\cordm\Desktop\stable-diffusion-webui-reForge\launch.py", line 51, in <module>
    main()
  File "C:\Users\cordm\Desktop\stable-diffusion-webui-reForge\launch.py", line 47, in main
    start()
  File "C:\Users\cordm\Desktop\stable-diffusion-webui-reForge\modules\launch_utils.py", line 542, in start
    import webui
  File "C:\Users\cordm\Desktop\stable-diffusion-webui-reForge\webui.py", line 19, in <module>
    initialize.imports()
  File "C:\Users\cordm\Desktop\stable-diffusion-webui-reForge\modules\initialize.py", line 53, in imports
    from modules import processing, gradio_extensons, ui  # noqa: F401
  File "C:\Users\cordm\Desktop\stable-diffusion-webui-reForge\modules\processing.py", line 18, in <module>
    import modules.sd_hijack
  File "C:\Users\cordm\Desktop\stable-diffusion-webui-reForge\modules\sd_hijack.py", line 5, in <module>
    from modules import devices, sd_hijack_optimizations, shared, script_callbacks, errors, sd_unet, patches
  File "C:\Users\cordm\Desktop\stable-diffusion-webui-reForge\modules\sd_hijack_optimizations.py", line 13, in <module>
    from modules.hypernetworks import hypernetwork
  File "C:\Users\cordm\Desktop\stable-diffusion-webui-reForge\modules\hypernetworks\hypernetwork.py", line 8, in <module>
    import modules.textual_inversion.dataset
  File "C:\Users\cordm\Desktop\stable-diffusion-webui-reForge\modules\textual_inversion\dataset.py", line 12, in <module>
    from modules import devices, shared, images
  File "C:\Users\cordm\Desktop\stable-diffusion-webui-reForge\modules\images.py", line 24, in <module>
    from modules import sd_samplers, shared, script_callbacks, errors
  File "C:\Users\cordm\Desktop\stable-diffusion-webui-reForge\modules\sd_samplers.py", line 5, in <module>
    from modules import sd_samplers_kdiffusion, sd_samplers_timesteps, sd_samplers_lcm, shared, sd_samplers_common, sd_schedulers
  File "C:\Users\cordm\Desktop\stable-diffusion-webui-reForge\modules\sd_samplers_kdiffusion.py", line 4, in <module>
    from modules import sd_samplers_common, sd_samplers_extra, sd_samplers_cfg_denoiser, sd_schedulers
  File "C:\Users\cordm\Desktop\stable-diffusion-webui-reForge\modules\sd_samplers_common.py", line 6, in <module>
    from modules import devices, images, sd_vae_approx, sd_samplers, sd_vae_taesd, shared, sd_models
  File "C:\Users\cordm\Desktop\stable-diffusion-webui-reForge\modules\sd_models.py", line 19, in <module>
    from modules_forge import forge_loader
  File "C:\Users\cordm\Desktop\stable-diffusion-webui-reForge\modules_forge\forge_loader.py", line 5, in <module>
    from ldm_patched.modules import model_detection
  File "C:\Users\cordm\Desktop\stable-diffusion-webui-reForge\ldm_patched\modules\model_detection.py", line 5, in <module>
    import ldm_patched.modules.supported_models
  File "C:\Users\cordm\Desktop\stable-diffusion-webui-reForge\ldm_patched\modules\supported_models.py", line 5, in <module>
    from . import model_base
  File "C:\Users\cordm\Desktop\stable-diffusion-webui-reForge\ldm_patched\modules\model_base.py", line 6, in <module>
    from ldm_patched.ldm.modules.diffusionmodules.openaimodel import UNetModel, Timestep
  File "C:\Users\cordm\Desktop\stable-diffusion-webui-reForge\ldm_patched\ldm\modules\diffusionmodules\openaimodel.py", line 23, in <module>
    from ..attention import SpatialTransformer, SpatialVideoTransformer, default
  File "C:\Users\cordm\Desktop\stable-diffusion-webui-reForge\ldm_patched\ldm\modules\attention.py", line 30, in <module>
    FORCE_UPCAST_ATTENTION_DTYPE = model_management.force_upcast_attention_dtype()
AttributeError: module 'ldm_patched.modules.model_management' has no attribute 'force_upcast_attention_dtype'
Panchovix commented 1 month ago

Okay, give me some minutes. I will probably revert last and all that commit that reduced performance.

Panchovix commented 1 month ago

Okay, I have pushed a commit that gets the state of the model management and sd basically as it was before the problematic update

https://github.com/Panchovix/stable-diffusion-webui-reForge/commit/e13096344c4b361163f6a85694103a667e791c87

Can you guys check if it's ok now?

rcas45 commented 1 month ago

https://github.com/Panchovix/stable-diffusion-webui-reForge/commit/e13096344c4b361163f6a85694103a667e791c87 works for me now as intended

Panchovix commented 1 month ago

Perfect. Closing the issue now. Even then, I will have to wonder how to apply some sd.py changes from comfy upstream into this branch, since it will be needed to support new models.

Panchovix commented 1 month ago

Hi there guys, I have again tried to update sd.py (to maybe support newer models)

Can you try if with this one you get bar performance on SDXL/Hi-res? Not on my case, but maybe my workflow is different.

https://github.com/Panchovix/stable-diffusion-webui-reForge/commit/2403d91983bd4b22ad7cb2b721a35bcb159a4328 https://github.com/Panchovix/stable-diffusion-webui-reForge/commit/ae432b18deb61f3b97270eb6a53cacaab95e3d3c

StylusEcho commented 1 month ago

Seems fine on my end.

Panchovix commented 1 month ago

Perfect, thanks for the confirmation.

So it is factible that the performance degradation came from some changes on VAE is managed actually on comfy (which aren't on reforge)