Much slower with flux1-dev-bnb-nf4-v2

Danamir commented 3 months ago

lllyasviel uploaded a v2 version of the Flux NF4 checkpoint, the differences being explained here.

On Forge the performances and outputs are the same with both versions. But with this custom node the performances are far worse with the v2, ie. 2 to 3 times slower.

I don't know if the node has to be updated for the "chunk 64 norm now stored in full precision float32" (sic.). I don't see a big difference of outputs, so for now I'm keeping to v1.

I'm running ComfyUI with the --cuda-malloc option only.

ycyy commented 3 months ago

lllyasviel maybe change the code to support v2 version

RachidAR commented 3 months ago

Change your code to this one in the ComfyUI_bitsandbytes_NF4 node (113-114 line):

Silanda commented 3 months ago

Speed's identical for me. Could it be that the extra size is pushing your VRAM usage ever so slightly over the limit?

Danamir commented 3 months ago

Change your code to this one in the ComfyUI_bitsandbytes_NF4 node (113-114 line):

This did not work for me. Still slower with the v2 version, at least it did not break the v1 version. 😅 Thanks for the try !

Danamir commented 3 months ago

Speed's identical for me. Could it be that the extra size is pushing your VRAM usage ever so slightly over the limit?

The VRAM is already maxed out with the v1 on my system, using the fallback option of Nvidia drivers. I did check my usage monitor, reporting roughly the same usage for VRAM and RAM.

JorgeR81 commented 3 months ago

Sebastian Kamph is working with the v2 model, without noticeable issues, but he has a 4090.

How to Upscale with FLUX. Comfy Workflow https://www.youtube.com/watch?v=j-9m2hOcyiU

tchesket commented 3 months ago

This is also happening for me. The VRAM does not appear to be maxed out. Interestingly, the very first time I generate using the v2 model, I get normal speeds (~2.x s/it). Any subsequent generations run much slower (~5 s/it). Even if I unload and reload the model. The v1 model does not exhibit this behavior. I have tried the code change mentioned above and that doesn't seem to make a difference either way.

tchesket commented 3 months ago

Actually, I think it may be a VRAM issue. The VRAM monitor shows only 96% used but task manager shows 8.1gb including shared GPU memory. It's kind of funny but also kind of sad, if I try to generate an image at 1016x1016 rather than 1024x1024, it goes ~3x faster. Must just BARELY be over, I wonder if it's possible to reduce my VRAM usage just a tiny bit more. I imagine this will be an issue for a lot of people with 8GB GPUs (rtx3070 here).

RandomGitUser321 commented 3 months ago

Actually, I think it may be a VRAM issue. The VRAM monitor shows only 96% used but task manager shows 8.1gb including shared GPU memory. It's kind of funny but also kind of sad, if I try to generate an image at 1016x1016 rather than 1024x1024, it goes ~3x faster. Must just BARELY be over, I wonder if it's possible to reduce my VRAM usage just a tiny bit more. I imagine this will be an issue for a lot of people with 8GB GPUs (rtx3070 here).

Any time I start up comfyui and try to run the model after it hasn't been opened in a long while(no residuals in Windows memory cache, so it has to completely reload everything again like a true cold boot vs sometimes the models will still be in the Windows cache an hour after you close comfy, if you haven't done much on your PC in that time), it seems like the first try will always be slowed to like 15 seconds per iteration. I just immediately cancel it and within comfy manager, unload models, then rerun it again. 2080 with 8gb vram and I know it's working when I see 2.5 sec/it. And that's at 1024^2.

newxhy commented 3 months ago

I9-13900HX 16GB RAM 4060 8G GPU 在comfyui中V2需要8分钟，V1需要1.5分钟，在forge中V2需要1.5分钟，所以应该是ComfyUI_bitsandbytes_NF4插件的问题，可能需要lllyasviel 修改代码。 In Comfyui, V2 takes 8 minutes, V1 takes 1.5 minutes, and in Forge, V2 takes 1.5 minutes. Therefore, it should be an issue with the ComfyUI-bitsandbytes-NF4 plugin, which may require LLLyasviel to modify the code.

martjay commented 3 months ago

Perhaps you need to update Forge, Comfyui, and all extensions to the latest version. I tested that the speed of Forge, Comfyui, speed no changed.

Danamir commented 3 months ago

Perhaps you need to update Forge, Comfyui, and all extensions to the latest version. I tested that the speed of Forge, Comfyui, speed no changed.

I forgot to mention it in the first post, but of course I updated ComfyUI, the custom node, and Forge before posting this issue. And I have been updating since to check if there is any change.

Neuromaked commented 3 months ago

I confirm. 3060ti8gb, 32ram. Latest ComfyUI: res - 920 | 1024 | 1200 (square) v1 - 2.24 | 2.55 | 5.15 [s/it] v2 - 4.54 | 6.01 | 24.5 [s/it] (slower in next generations) In Forge, both versions are equally fast. I also tried the Salto and Bitte models, they work fast as v1. And same tests in pf8/16 and gguf models (t5xxlfp16): res (square) - 920 | 1024 | 1200 | 2560x1440 dev fp8/fp16 - 15 | 15.5 | 15.5 | 26 [s/it] dev gguf Q8 - 4.6 | 4.63 | 6.12 | 15 [s/it] nf4 models in res 2560x1440 give memory errors or run incredibly long

martjay commented 3 months ago

After updating comfyui to the latest version, my node hasn't changed, but the generation has become very slow, just like what tchesket said. Oh my god, I'm so frustrated. I don't know what happened, but I just can't find a way to improve it.

Snipaste_2024-08-22_10-28-07 Snipaste_2024-08-22_10-27-48

martjay commented 3 months ago

Snipaste_2024-08-22_11-02-59

OMG, help! please!

Danamir commented 3 months ago

Updated ComfyUI this morning, I still got great performances with the v1 model. Could you try it instead of v2 ?

comfyanonymous / ComfyUI_bitsandbytes_NF4

Much slower with flux1-dev-bnb-nf4-v2 #35