comfyanonymous / ComfyUI

The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
https://www.comfy.org/
GNU General Public License v3.0
50.64k stars 5.32k forks source link

Slowness with lora and control net (any control net model) for Flux #4757

Open axel578 opened 1 week ago

axel578 commented 1 week ago

Expected Behavior

No high VRAM usage, and no extreme slowness with controlnet

Actual Behavior

Technical details : latest version of comfy ui, 3090, pytorch 2.1 cuda 12.1, windows.

I currently use Comfy UI in production and this is really blocking because using multiple more than 32 rank lora on top of flux is extremely VRAM hungry, and using any control net with ControlNetApplyAdvanced or even the one for SD3/Hyuandit is extremely slow.

Comfy UI is currently not stable with my current configuration (windows is not a choice).

In case using GGUF doesnt help at all since speed is 1.8 times slower and control net support is not working for all models.

Steps to Reproduce

Technical details : latest version of comfy ui, 3090, pytorch 2.1 cuda 12.1, windows.

Just use any control net, or high rank lora stacked.

Debug Logs

None

Other

None

comfyanonymous commented 1 week ago

Update your pytorch to at least 2.3 and your nvidia drivers to the latest.

axel578 commented 1 week ago

Update your pytorch to at least 2.3 and your nvidia drivers to the latest.

I update to 2.3.1 and latest driver and the exact same issue occur with very high vram usage (8Gb for 64 rank lora) and control net extremely slow generation.

comfyanonymous commented 1 week ago

Are you sure? try downloading the latest standalone package from the readme.

vivek-kumar-poddar commented 1 week ago

Downloaded the standalone package, updated everything, and now its 34.99s/it... previously it was 1.24it/s. Mine system has 4070ti, 64gb DDR5 ram, core i7-14700k. How do I revert back to previous version?

Also noticed that loading more than one lora file will increase the generation time by 10 to 15 seconds per iteration.

JunesiPhone commented 1 week ago

2080ti 11gb windows made sure pytorch and nvidia were updated reinstalled from readme and same result as my updated install

I've had this issue for a little while now. I wish I knew what update changed it, but didn't keep up with it (comfy version number isn't in plain sight) If I had to guess i'd say it was within the last 3 updates. Didn't happen with this last one and didn't happen with the one before it. Was before that. Same story as the rest I had no problems generate images with flux and a lora, but now having 1 lora kills it. Roughly 14m for a single image. It does work, just very very slow.

I looked at the issues and saw it was reported 4 or 5 days ago so just been patient. As a dev myself I recognize this "Are you sure?" So just letting you know there is others that have followed your steps and this odd issue persists.

op7418 commented 1 week ago

I'm experiencing the same issue. The official Controlnet workflow runs fine with some VRAM to spare. However, as soon as I add an 18M Lora to the workflow, the VRAM immediately explodes.

Allocation on device 0 would exceed allowed memory. (out of memory) Currently allocated : 22.47 GiB Requested : 72.00 MiB Device limit : 23.99 GiB Free (according to CUDA): 0 bytes PyTorch limit (set by user-supplied memory fraction) : 17179869184.00 GiB

comfyanonymous commented 1 week ago

Can you check if things have improved on the latest commit?

sabutay commented 1 week ago

I have the same problem, I can't use two Lora at the same time, it slows down a lot with 4070 ti Super. With 1 Lora it is also slower than normal I'm using flux.dev16

axel578 commented 1 week ago

Can you check if things have improved on the latest commit?

image

Its the exact same issue, on latest commit you did for fp8 lora, I used your 2.0 version, The exact same issue, maybe even a little slower.

ltdrdata commented 1 week ago

Can you check if things have improved on the latest commit?

image

Its the exact same issue, on latest commit you did for fp8 lora, I used your 2.0 version, The exact same issue, maybe even a little slower.

The issue you're experiencing is related to shared memory. The best solution is to configure your GPU to not use VRAM as shared memory. If this isn't possible, you should use the --disable-smart-memory option to minimize VRAM usage. The next option to consider is the --reserve-memory option.

Stoobs commented 1 week ago

I was having horrendous slowdown issues with the previous portable release, sometimes with multiple minutes per iteration which made batch running impossible. However updating to the latest release v0.2.2 with the update to pytorch 124 has me back down to 2.6secs/iter.

7950x, 64GB DDR5, RTX 3080 10GB

Might fix others issues too?