1038lab / ComfyUI-OmniGen

ComfyUI-OmniGen - A ComfyUI custom node implementation of OmniGen, a powerful text-to-image generation and editing model.
MIT License
64 stars 6 forks source link

Saving disk and download time (plus VRAM) #19

Open set-soft opened 2 days ago

set-soft commented 2 days ago

I manually downloaded the model from here:

https://huggingface.co/silveroxides/OmniGen-V1/tree/main

Renamed the fp8 file to model.safetensors and got it working.

The FP8 model is just 3.7 GB

1038lab commented 2 days ago

how's FP8 version, is it run faster?

set-soft commented 1 day ago

Not sure if faster, it save disk space and download time. You should also investigate: https://github.com/newgenai79/OmniGen/ It also maintains an FP8 in VRAM, enabling its use on 8 GiB boards. I tried your addon in combination with the OmniGen code from https://github.com/chflame163/ComfyUI_OmniGen_Wrapper This code applies some quantization, and was fine for a 12 GiB board. For this I used the FP8 file, just because my internet connection is slow and I didn't want to wait for hours (I already had the FP8 downloaded). So the FP8 (on disk) works for your addon, but using chflame163 copy of OminGen code. I also verified that newgenai79 code (which is for the original demo, not a Comfy_UI addon) works perfectly with the FP8 file and uses only 55% of my VRAM.

1038lab commented 1 day ago

updated , try the new version

set-soft commented 11 hours ago

Hi @1038lab ! I'm afraid it doesn't work, it doesn't even start to do the inference. For some reason you are unconditionally loading everything to VRAM, first the VAE (330 MB) and then the model, and at this point not even 12 GB are enough. The call "pipe = pipe.to(device)" fails, the model isn't loaded and the VRAM gets 10926 MB of failed load. Then at the beginning of the pipeline you go again and do "self.model.to(dtype)" which fails over a fail. Your strategy is for a 16 GB or more board. This is with "memory_management" set to "Memory Priority" The only thing that "Memory Priority" is doing is just ask for "offload_model" which doesn't help much and makes thing really slow. When I tested it on an older version it didn't help at all, and it moves layers to just 1 CPU core, not sure if the code still does this.

The main problem I see here is the strategy of downloading upstream code, which doesn't implement a good memory strategy. You must incorporate it and patch the code to do the proper things.

Also: loading the FP8 file won't solve memory issues, PyTorch loads it using some current default dtype, so it gets expanded once loaded, it just small on disk. To get quantization working you must patch the nn.Linear layers.

BTW: Please don't use print, use logging.debug, if you use print the messages goes to the console, but the GUI can't catch them.

1038lab commented 7 hours ago

Applying quantization is a good approach. I'll make an effort to update it when I have the time.