city96 / ComfyUI-GGUF

GGUF Quantization support for native ComfyUI models
Apache License 2.0
711 stars 41 forks source link

Running with 12GB RAM (not VRAM)? #14

Open GXcells opened 1 month ago

GXcells commented 1 month ago

Is there a way to run these models with 12 GB RAM? With fp8 models it is working but with GGUF models it always fail.

city96 commented 1 month ago

You mean CPU inference? I'll mark this as a duplicate of #10 and track them as one feature request.

GXcells commented 1 month ago

No not CPU, GPU inference but with low RAM (I have 16GB VRAM so it is fine, but the RAM is problematic)

city96 commented 1 month ago

Ah right, I see what you mean. I think once the reload bug is fixed it should make the model smaller, although the T5 text encoder will still take up a lot of RAM so until we can get gguf quants of that working it probably won't work. https://github.com/city96/ComfyUI-GGUF/issues/5

I assume the regular ComfyUI flux model doesn't work for you either without OOMing or swapping right?

GXcells commented 1 month ago

Actually it seems that model does not load in my VRAM but it loads in my RAM. Not sure why

GXcells commented 1 month ago

It is during the gguf_sd_loader function. It seems that model is first loaded in RAM then would be transferred to VRAM? Is there a way to directly do this in VRAM?

city96 commented 1 month ago

That's expected. We can technically load into VRAM directly but there's no easy way to check how much free VRAM we have/need since we don't know the size of the checkpoint until we load it (at least I don't think that's easy to check? gguf can use mmap so it might be possible to lazyload them as well.)

city96 commented 1 month ago

@GXcells Okay so I think I got numpy's mmap to play ball and load it directly onto the GPU, or at least reduce memory pressure by a lot. Could you git pull and retest?

Meshwa428 commented 1 month ago

Okay so i am with 12 gb of RAM (Not VRAM) will i be able to run this model?? has any one tried it?

city96 commented 1 month ago

Minimum seems to be 13GBs system RAM even with the crappy FP8 T5 model :(

Maybe if you close everything else and add some swap it could manage

GXcells commented 1 month ago

@GXcells Okay so I think I got numpy's mmap to play ball and load it directly onto the GPU, or at least reduce memory pressure by a lot. Could you git pull and retest?

Working like a charm, but now I am stuck with the T4 GPU that throws a Cuda OOM when running the function sd.load_diffusion_model_state_dict, even with 4 bit quantized model, in a version of comfy that runs without UI

city96 commented 1 month ago

You're not using the CUDA override node with a single GPU right?

GXcells commented 1 month ago

I am using the branch "totoro4" from https://github.com/camenduru/ComfyUI to use in a jupyter notebook

city96 commented 1 month ago

Hmm, not familiar with that. T4 should have plenty of VRAM for that unless there's a memory leak or a bug in whatever cuda/numpy/etc versions you have? Or maybe that repository changes the way the function works and tries to load the state dict that we return to cuda? Not too sure, might look into it although there's other, more pressing things that need adding/fixing/optimizing for now.

GXcells commented 1 month ago

pressing

Yes a bit weird because I could run the 4 bits quantized models on a RTX3050 4GB VRAM thanks to your node in ComfyUI. I'll post an issue on the Camenduru github but I also believe they will probably implement your node soon. I will stay with fp8 models for the time being on the T4 GPU. Thanks a lot for your super support :)