Allow loading unquantized models and BNB dtype selection

blepping commented 3 months ago

mostly just putting this here for example/discussion purposes.

these changes seem to work but i really don't know if the implementation is correct. this pull:

makes loading unquantized checkpoints work (tested with SD15)
adds a CheckpointLoaderBNB node that lets you select BNB fp4 or nf4 quantization (the default option works like before, so it will fail for un-prequantized models)
fixes else case that referred to unknown symbols - now you'll get a runtime error if execution ever reaches that point. not sure if it's possible

i also looked at implementing BNB's int8 but it seems like the matmul for int8 doesn't support the operations Comfy wants to use.

bananasss00 commented 3 months ago

It works with SDXL models, but when I add LoRA, I get the following error:

TypeError: All input tensors need to be on the same GPU, but found some tensors to not be on a GPU: [(torch.Size([1, 204800]), device(type='cuda', index=0)), (torch.Size([6400]), device(type='cpu')), (torch.Size([320, 1280]), device(type='cuda', index=0))]

Can you fix this?

and sometimes, with lora too:

File "V:\comfyu_py311\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "V:\comfyu_py311\python_embeded\Lib\site-packages\torch\nn\modules\container.py", line 217, in forward input = module(input) ^^^^^^^^^^^^^ File "V:\comfyu_py311\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "V:\comfyu_py311\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "V:\comfyu_py311\ComfyUI\custom_nodes\ComfyUI_bitsandbytes_NF4-fork__init.py", line 171, in forward return functional_linear_4bits(x, self.weight, self.bias) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "V:\comfyu_py311\ComfyUI\custom_nodes\ComfyUI_bitsandbytes_NF4-fork\init__.py", line 12, in functional_linear_4bits out = bnb.matmul_4bit(x, weight.t(), bias=bias, quant_state=weight.quant_state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "V:\comfyu_py311\python_embeded\Lib\site-packages\bitsandbytes\autograd_functions.py", line 566, in matmul_4bit assert quant_state is not None ^^^^^^^^^^^^^^^^^^^^^^^ AssertionError

blepping commented 3 months ago

@bananasss00

It works with SDXL models, but when I add LoRA, I get the following error:

even if you got past that point it still wouldn't work. i actually spent some time trying to figure out how to get LoRAs working but wasn't successful. The Comfy LoRA loader logic is pretty complicated, it ends up trying to add the non-quantized LoRAs to the quantized model or something like that.

however you can use the CLIP part of LoRAs without an issue if you also load the model normally for the prompt encoding part. i also don't recommend using the CLIP output from the BNB nodes: there's really no reason to use quantized CLIP (at least with SDXL and SD1x). in other words: use a normal model loader and it's CLIP output to encode the prompt to CONDITIONING - you can apply whatever LoRAs you want, but only the CLIP part will have an effect. then you can use the MODEL output from the BNB loader node for sampling. hope that makes sense.

bananasss00 commented 3 months ago

@blepping I'm encountering an error when running the comfyui with fp8 parameters.

Traceback

```py Blockwise quantization only supports 16/32-bit floats, but got torch.float8_e4m3fn Traceback (most recent call last): File "V:\comfyu_py311\ComfyUI\execution.py", line 152, in recursive_execute output_data, output_ui = get_output_data(obj, input_data_all) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "V:\comfyu_py311\ComfyUI\execution.py", line 82, in get_output_data return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "V:\comfyu_py311\ComfyUI\execution.py", line 75, in map_node_over_list results.append(getattr(obj, func)(**slice_dict(input_data_all, i))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "V:\comfyu_py311\ComfyUI\nodes.py", line 1382, in sample return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "V:\comfyu_py311\ComfyUI\nodes.py", line 1352, in common_ksampler samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "V:\comfyu_py311\ComfyUI\comfy\sample.py", line 43, in sample samples = sampler.sample(noise, positive, negative, cfg=cfg, latent_image=latent_image, start_step=start_step, last_step=last_step, force_full_denoise=force_full_denoise, denoise_mask=noise_mask, sigmas=sigmas, callback=callback, disable_pbar=disable_pbar, seed=seed) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "V:\comfyu_py311\ComfyUI\comfy\samplers.py", line 829, in sample return sample(self.model, noise, positive, negative, cfg, self.device, sampler, sigmas, self.model_options, latent_image=latent_image, denoise_mask=denoise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "V:\comfyu_py311\ComfyUI\comfy\samplers.py", line 729, in sample return cfg_guider.sample(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "V:\comfyu_py311\ComfyUI\comfy\samplers.py", line 706, in sample self.inner_model, self.conds, self.loaded_models = comfy.sampler_helpers.prepare_sampling(self.model_patcher, noise.shape, self.conds) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "V:\comfyu_py311\ComfyUI\comfy\sampler_helpers.py", line 66, in prepare_sampling comfy.model_management.load_models_gpu([model] + models, memory_required=memory_required, minimum_memory_required=minimum_memory_required) File "V:\comfyu_py311\ComfyUI\comfy\model_management.py", line 527, in load_models_gpu cur_loaded_model = loaded_model.model_load(lowvram_model_memory, force_patch_weights=force_patch_weights) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "V:\comfyu_py311\ComfyUI\comfy\model_management.py", line 325, in model_load raise e File "V:\comfyu_py311\ComfyUI\comfy\model_management.py", line 321, in model_load self.real_model = self.model.patch_model(device_to=patch_model_to, patch_weights=load_weights) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "V:\comfyu_py311\ComfyUI\comfy\model_patcher.py", line 352, in patch_model self.model.to(device_to) File "V:\comfyu_py311\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1174, in to return self._apply(convert) ^^^^^^^^^^^^^^^^^^^^ File "V:\comfyu_py311\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 780, in _apply module._apply(fn) File "V:\comfyu_py311\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 780, in _apply module._apply(fn) File "V:\comfyu_py311\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 780, in _apply module._apply(fn) File "V:\comfyu_py311\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 805, in _apply param_applied = fn(param) ^^^^^^^^^ File "V:\comfyu_py311\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1160, in convert return t.to( ^^^^^ File "V:\comfyu_py311\ComfyUI\custom_nodes\ComfyUI_bitsandbytes_NF4-fork\__init__.py", line 77, in to return self._quantize(device) ^^^^^^^^^^^^^^^^^^^^^^ File "V:\comfyu_py311\python_embeded\Lib\site-packages\bitsandbytes\nn\modules.py", line 297, in _quantize w_4bit, quant_state = bnb.functional.quantize_4bit( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "V:\comfyu_py311\python_embeded\Lib\site-packages\bitsandbytes\functional.py", line 1238, in quantize_4bit raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}") ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.float8_e4m3fn ```

I fixed it by https://gist.github.com/bananasss00/bdfc67da5e1642e4e94796574d5826a0#file-__init__-py-L77

but the model loads slowly the first time. Is there a better fix for fp8 mode?

blepping commented 3 months ago

I'm encountering an error when running the comfyui with fp8 parameters.

i don't think doing that makes any sense. you need to pick one type of quantization.

likely what you're doing is loading the model, then casting it down to 8 bit (which is a lossy process), then converting it back up to 16 bit then quantizing it down to 4 bit. it's converting the model multiple times which naturally is going to be slow. the information that got thrown away converting it to 8 bit cannot be recovered, even if you cast it back to 16 bit later on.

it's basically like if you had a PNG file, then you JPEG compressed it with medium compression, then you converted it back to PNG then you JPEG compressed it with high compression - it's going to be both slower and worse quality than if just took the original high quality source and used JPEG high compression on it.

bananasss00 commented 3 months ago

@blepping I have already compressed all my SD15 and SDXL models to FP8 for space efficiency, so the quality will not degrade further in my case.

ComfyUI is running continuously in FP8 mode. Therefore, when there is a need to load Flux or another model in NF4/FP4, there should be an option to simply add a node without having to restart ComfyUI.

blepping commented 3 months ago

I have already compressed all my SD15 and SDXL models to FP8 for space efficiency, so the quality will not degrade further in my case.

you haven't avoided the issue i mentioned, you just did the first step of it in the past. by converting a FP8 model to NF4 or whatever, you are getting hit with quality loss multiple times instead of just once.

ComfyUI is running continuously in FP8 mode.

like i said, i don't recommend doing that if you're going to be using the BNB quantization too.

Therefore, when there is a need to load Flux or another model in NF4/FP4, there should be an option to simply add a node without having to restart ComfyUI.

that would be ideal, but sadly it's beyond my capabilities. i don't even know how this stuff interacts with Comfy's internal casting/quantization if you have FP8 mode enabled. you could try creating an issue here on maybe in the main Comfy repo asking for that feature, not sure how good your chances are of it actually being implemented are.

bananasss00 commented 3 months ago

https://gist.github.com/bananasss00/301d666f20370412a571570ec304527c

changes, mb you want add too:

blepping commented 3 months ago

changes, mb you want add too

looks useful, though i'd probably take a slightly different approach implementing it. i don't think there's much chance of this pull getting merged though, so probably the only way to make it available would to be to publish it as a different node.

nux1111 commented 2 months ago

@bananasss00 any idea how to solve this? I get it when i try to load a lora: File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\bitsandbytes\functional.py", line 432, in is_on_gpu raise TypeError( TypeError: All input tensors need to be on the same GPU, but found some tensors to not be on a GPU: [(torch.Size([28311552, 1]), device(type='cuda', index=0)), (torch.Size([1, 3072]), device(type='cuda', index=0)), (torch.Size([1, 18432]), device(type='cuda', index=0)), (torch.Size([884736]), device(type='cpu')), (torch.Size([16]), device(type='cpu'))]

blepping commented 2 months ago

any idea how to solve this? I get it when i try to load a lora

you can't use LoRAs with NF4. i recommend switching to GGUF if possible (which does allow using LoRAs): https://github.com/city96/ComfyUI-GGUF

nux1111 commented 2 months ago

@blepping Thanks for the response. I thought Forge had solve this and that the code @bananasss00 posted does that too.

blepping commented 2 months ago

Thanks for the response. I thought Forge had solve this and that the code bananasss00 posted does that too.

think bananasss' changes just added some improvements to the node (like letting you choose what parts to load), it didn't add core functionality.

there's nothing that makes NF4 and LoRAs inherently impossible so Forge may have implemented it. that won't help you here though unless someone wants to try to port the changes. not too much enthusiasm for that since it's basically been abandoned/deprecated here.

BTW, i contributed changes to ComfyUI-GGUF to support SD1x and SDXL models as well as stable-diffusion.cpp quantized. so even though the documentation says "don't use with SD1/SDXL", it's out of date. just mentioning that in case that's what was holding you back from switch.

nux1111 commented 2 months ago

@blepping Huge! thanks for your contributions, I will check sd1/sdxl quants, didn't know it was possible.

nux1111 commented 2 months ago

you can post some sd1/sdxl gguf model download links if you have one.

blepping commented 2 months ago

you can post some sd1/sdxl gguf model download links if you have one.

sorry, i don't have any. it's a pretty recent thing. you'd need to convert them yourself. the easiest way is probably with https://github.com/leejet/stable-diffusion.cpp

comfyanonymous / ComfyUI_bitsandbytes_NF4

Allow loading unquantized models and BNB dtype selection #21