20 times slower rather than speedup

comfyanonymous / ComfyUI_bitsandbytes_NF4

GNU Affero General Public License v3.0

275 stars 19 forks source link

20 times slower rather than speedup #9

Open PierreHoule opened 1 month ago

PierreHoule commented 1 month ago

I have a RTX 2060 Super (8GB VRAM) and 64GB RAM. Using this NF4 checkpoint loader with either the Shnell of Dev NF4 variants, the image generation time increases 20-fold rather than decreases. Might my video card be incompatible?

On edit: The problem was solved after I removed SplitSigmas from my workflow. It now runs 4x faster than fp8. The only remaining issue is that I run out of VRAM with batches of more than one single 1536p x 1024p images. With fp8, I can easily do batches of three such images. This wipes out much of the speed gain.

boricuapab commented 1 month ago

I also didn't see any inference speed gain on my rtx 2060 super either, I get the same speed as using the fp8 dev model

TripleHeadedMonkey commented 1 month ago

If I am correct in my assessment of what this is, it has multiple levels of precision. So it requires that you manually limit what Floating point precision it can use for you to get the optimum performance as the default output is to max it at the highest precision (FP32)

Previous versions of the model were in themselves limited in their precision so the manual adjustment was not required.

Maelstrom2014 commented 1 month ago

same, no speed up or freeup more vram.

boricuapab commented 1 month ago

Actually the speed gained is actually working after a first run.

It takes about 7 minutes on a first run but then on subsequent runs my generation time goes down from 7 minutes to 1 minute

bnbNF4_2060superstiming3

marhensa commented 1 month ago

Actually the speed gained is actually working after a first run.

It takes about 7 minutes on a first run but then on subsequent runs my generation time goes down from 7 minutes to 1 minute

actually for me, the problem lies when the prompt changed, it switched to LOWVRAM mode.

1st initial load is okay to be slow, it's should be, because it's loading models etc.

2nd generation, with same prompt, different seed, it's so fast, much faster than FP8.

3rd generation test, change the prompt, it will takes longer, and for me it switched to LOWVRAM.

I also comment in here, could be related issue (or simply my PC can't handle it, RTX 3060 12GB, RAM 32G)

PierreHoule commented 1 month ago

I also didn't see any inference speed gain on my rtx 2060 super either, I get the same speed as using the fp8 dev model

I only experience a 4x speed increase with images up to 1152x1152 in resolution produced one at a time. Any resolution higher than that slows thing down dramatically. Also, with the flux and dev fp8 models, I can do batches of two of three images (1536x1024). With nf4, I get out of memory errors when attempt to do batches of three 1024x1024 pictures, and a huge slowdown with batches of two images. There seems to be something wrong with memory management when the RTX 2060 Super is used.

WainWong commented 1 month ago

When I use the single file version of FP8, generating a 1024*1024 graph takes up about 14g of VRAM, with a peak of 31g of RAM; when I use the nf4 version, it takes up about 12.7g of VRAM, with a peak of about 16g of RAM, and both of them are at about the same speed, and the reduction of video memory usage doesn't seem to have been as much as I expected.