comfyanonymous / ComfyUI_bitsandbytes_NF4

GNU Affero General Public License v3.0
332 stars 28 forks source link

Needs some kind of way to unload/offload after it's done sampling #4

Closed RandomGitUser321 closed 3 months ago

RandomGitUser321 commented 3 months ago

As far as I know, a BNB'd model will anchor in the VRAM and can't easily be moved back to system memory. I have 8gb vram and even after sampling, it will stay mostly full and keep triggering the VAE decode to try to use tiled mode, due to not having enough VRAM.

It doesn't seem to free the vram, even after deleting the checkpointloadernf4 workflow and then doing something like making a new workflow using an sdxl model, for instance, the VRAM will still be full. If I'm not mistaken, I think you have to just delete the whole object that's anchored in the VRAM, but maybe keep a copy in system memory so that it can reload quick next time (ram->vram instead of from the drive)?

Ratinod commented 3 months ago

And if you change the prompt text, the "out of memory" error will definitely appear.

RandomGitUser321 commented 3 months ago

And if you change the prompt text, the "out of memory" error will definitely appear.

Yeah, that's because it would normally try to offload the model to sysmem, then shuffle the t5 back into vram to generate the new prompt, then offload the t5 and reload the model, then sample. But since BnB is anchoring it, you'll probably OOM.

marhensa commented 3 months ago

And if you change the prompt text, the "out of memory" error will definitely appear.

Yeah, that's because it would normally try to offload the model to sysmem, then shuffle the t5 back into vram to generate the new prompt, then offload the t5 and reload the model, then sample. But since BnB is anchoring it, you'll probably OOM.

i have 12 GB of VRAM, the model (flux1-schnell-bnb-nf4.safetensors) is 11.2 GB, barely fits the VRAM (I'm making sure my system is very barebone and clean).

the first generation is working fast using VRAM, but the second prompt ComfyUI starts using lowvram mode, that loads unloads some models to RAM that makes it very slow.

is it any way to make it stays on VRAM?

or if the problem is T5 Clip? is there any way to fix this?

comfyanonymous commented 3 months ago

This should be fixed with the latest commit.

Ratinod commented 3 months ago

This should be fixed with the latest commit.

I confirm. Now there is no OOM. Thanks.

Ulexer commented 3 months ago

This should be fixed with the latest commit.

Fix doesn't work for me. After first generation it executes text encoder on the cpu.

got prompt
model weight dtype torch.bfloat16, manual cast: None
model_type FLUX
Using xformers attention in VAE
Using xformers attention in VAE
Requested to load FluxClipModel_
Loading 1 new model
C:\AI\ComfyUI_windows_portable\ComfyUI\comfy\ldm\modules\attention.py:407: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
  out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
Requested to load Flux
Loading 1 new model
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.62s/it]
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 14.66 seconds
got prompt
loaded in lowvram mode 3991.193339538574
loaded completely 6049.751877021789 5859.856831550598
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.57s/it]
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 33.91 seconds
marhensa commented 3 months ago

This should be fixed with the latest commit.

Fix doesn't work for me. After first generation it executes text encoder on the cpu.

got prompt
model weight dtype torch.bfloat16, manual cast: None
model_type FLUX
Using xformers attention in VAE
Using xformers attention in VAE
Requested to load FluxClipModel_
Loading 1 new model
C:\AI\ComfyUI_windows_portable\ComfyUI\comfy\ldm\modules\attention.py:407: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
  out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
Requested to load Flux
Loading 1 new model
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.62s/it]
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 14.66 seconds
got prompt
loaded in lowvram mode 3991.193339538574
loaded completely 6049.751877021789 5859.856831550598
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.57s/it]
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 33.91 seconds

As long as the prompt is not changed, the next seed generation will not bother the T5 Clip. That makes first generation always 2x faster than after prompt changed.

For me, and possibly for you as well, when the prompt is changed, the T5 Clip slows everything down and makes Comfy run in LOW VRAM mode.

I don't know why, but that's what I've been observing.

marhensa commented 3 months ago

image

is there any explanation for this, and how to prevent this?

I use the Schnell NF4, RTX 3060 12GB VRAM, on PC 32 GB RAM.

marhensa commented 3 months ago

I found some workaround to prevent those slowdown after changing the prompt,

it's with click Unload Model from Comfy Manager.

image

  1. change the prompt
  2. go to manager, click unload models
  3. queue the new prompt
  4. profit! it's much faster this way!

    loading the models again is not that slow, it's faster than waiting for the Clip process / lowvram mode.

image

it's fast as it should be:

compared to 45 seconds per image (prompt change without unloads model). that is 2.5X slower. unloading models only adds 2-3 seconds to the total generation time.

RTX 3060 12GB, RAM 32GB.

comfyanonymous commented 3 months ago

Can I see your exact workflow?

marhensa commented 3 months ago

Can I see your exact workflow?

sure.. here, thank you @comfyanonymous for your help and attention.

Workflow-NF4-Schnell.json Workflow-NF4-Schnell

this problem shown in video: https://youtu.be/2JaADaPbHOI

  1. normal run without changing prompt (quick)
  2. changing prompt (slow because turned into lowvram mode)
  3. changing prompt with unload models (quick + 3 seconds)
comfyanonymous commented 3 months ago

And both the node and comfyui are updated to the latest version?

marhensa commented 3 months ago

And both the node and comfyui are updated to the latest version?

image

ComfyUI: e9589d6d9246d1ce5a810be1507ead39fff50e04 (17 hours ago) Said Node: f1935bd901860d4c1401dde5106f4c9543735ce8 ( 5 hours ago)

I see there's update in ComfyUI: Support loading directly to vram with CLIPLoader node.

I'll check it, and give an update.

boricuapab commented 3 months ago

I also notice a bit of a speed gain in the total prompt timing after clicking on the clear model button

bbnf4modelclearing

marhensa commented 3 months ago

And both the node and comfyui are updated to the latest version?

I see there's update in ComfyUI: Support loading directly to vram with CLIPLoader node.

I'll check it, and give an update.

updated to the latest git pull both comfy and the node, still have this problem.

26 seconds: initial load first generate 16 seconds: 2nd generate without changing prompt 47 seconds: 3rd generate with changing prompt 16 seconds: 4th generate without changing prompt 20 seconds: 5th generate with changing prompt + unload model first

in video: https://youtu.be/nmjhOKDp6VY

RandomGitUser321 commented 3 months ago

What Windows version and CPU are you guys using? This could be that annoying Windows 11 scheduler issue where it sometimes runs stuff on e-cores.

marhensa commented 3 months ago

What Windows version and CPU are you guys using? This could be that annoying Windows 11 scheduler issue where it sometimes runs stuff on e-cores.

I'm using oldies Ryzen 5 3600, RTX 3060 12GB, RAM 32GB, OS: latest always updating Windows 11, currently 23H2 Build 22631.3958

Python 3.10.6, Comfy Manual (non portable), using virtualenv, torch 2.2.2+cu121. Argument when running: .\venv\Scripts\python.exe -s main.py --disable-xformers --listen --port 8189

List of pip installed inside virtualenv: pip list.txt

Okay I gotta go searching that scheduler issue. I can try to run it in Ubuntu WSL2 though, if all fails.

RandomGitUser321 commented 3 months ago

Yeah I don't know much about AMD CPUs, but if they have some kind of equivalent to p-cores and e-cores, it could be a similar thing. Just throwing it out there as an idea, it may or may not be relevant.

Ulexer commented 3 months ago

I found some workaround to prevent those slowdown after changing the prompt,

it's with click Unload Model from Comfy Manager.

Found a slightly more convenient workaround, to use modified node from https://github.com/LarryJane491/ComfyUI-ModelUnloader to automatically unload models after the image is generated.

Снимок экрана 2024-08-12 150401

from comfy import model_management

class ModelUnloader:
    @classmethod
    def INPUT_TYPES(cls):
        return {
            "required": {
                "image": ("IMAGE",),
            },
            "optional": {}
        }

    RETURN_TYPES = ("IMAGE",)
    RETURN_NAMES = ("image_output",)

    FUNCTION = "unload_model"

    CATEGORY = "loaders"

    def unload_model(self, image):
        loadedmodels=model_management.current_loaded_models
        unloaded_model = False
        for i in range(len(loadedmodels) -1, -1, -1):
            m = loadedmodels.pop(i)
            m.model_unload()
            del m
            unloaded_model = True
        if unloaded_model:
            model_management.soft_empty_cache()
        return (image,)

NODE_CLASS_MAPPINGS = {
    "Model unloader": ModelUnloader,
}
marhensa commented 3 months ago

I found some workaround to prevent those slowdown after changing the prompt, it's with click Unload Model from Comfy Manager.

Found a slightly more convenient workaround, to use modified node from https://github.com/LarryJane491/ComfyUI-ModelUnloader to automatically unload models after the image is generated.

@Ulexer the node wont connect to anything? how can I use it?

Edit: oh I see, I need to paste your code into that node (edit the modelunload.py)

NICE! It solved my problem—like, really on point!

Now I can use Flux Schnell NF4 and ComfyUI without any issues because the Wildcard node changes the prompt with every generation.

Without the Unload Node you edited, it significantly slows the process down, putting it in low VRAM mode.

Thank you! You should create a pull request for that node, though. Your edit made it work, so thank you! @Ulexer

image image

MrUSBEN commented 3 months ago

I found some workaround to prevent those slowdown after changing the prompt, it's with click Unload Model from Comfy Manager.

Found a slightly more convenient workaround, to use modified node from https://github.com/LarryJane491/ComfyUI-ModelUnloader to automatically unload models after the image is generated.

Снимок экрана 2024-08-12 150401

from comfy import model_management

class ModelUnloader:
    @classmethod
    def INPUT_TYPES(cls):
        return {
            "required": {
                "image": ("IMAGE",),
            },
            "optional": {}
        }

    RETURN_TYPES = ("IMAGE",)
    RETURN_NAMES = ("image_output",)

    FUNCTION = "unload_model"

    CATEGORY = "loaders"

    def unload_model(self, image):
        loadedmodels=model_management.current_loaded_models
        unloaded_model = False
        for i in range(len(loadedmodels) -1, -1, -1):
            m = loadedmodels.pop(i)
            m.model_unload()
            del m
            unloaded_model = True
        if unloaded_model:
            model_management.soft_empty_cache()
        return (image,)

NODE_CLASS_MAPPINGS = {
    "Model unloader": ModelUnloader,
}

You node didn't work for me so i used the other unloader but the idea worked, so thankyou.

comfyanonymous commented 3 months ago

If you still have issues with latest ComfyUI can you run it with: --verbose and give me the full log?

marhensa commented 3 months ago

@comfyanonymous Okay, I already update ComfyUI to the latest, and adds args --verbose

Here's the test:

With unloading method (from manager) and even with changing the prompt, the log is: Unloading AutoencodingEngine Unloading FluxClipModel Unloading Flux got prompt ... ... Prompt executed in 19.31 seconds <-- no lowvram, even changing the prompt_

For me it's 46 vs 19 seconds, I'd choose to unload models even it's not convinent.

Here's the workflow: Workflow-NF4-Schnell.json Here's the verbose logs: verbose-logs-20240813-comfyui.txt

comfyanonymous commented 3 months ago

Thanks it should actually be fixed now if you update ComfyUI.

marhensa commented 3 months ago

Thanks it should actually be fixed now if you update ComfyUI.

woaaw that was fast, you are genius, many thanks.

generating images with changing the prompt, now takes 21 seconds only: Prompt executed in 21.11 seconds I saw logs about unloading the models when the prompt changed.

even better than workaround above (unload node), without changing the prompt, there no need to unload, and the speed can be still fast: Prompt executed in 15.83 seconds

thank you very much @comfyanonymous , you are the best!