comfyanonymous / ComfyUI

The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
https://www.comfy.org/
GNU General Public License v3.0
53.29k stars 5.65k forks source link

FP8 doesn't work on GTX 1650 #2253

Closed FNSpd closed 9 months ago

FNSpd commented 10 months ago

When I try to generate image using FP8, I'm getting this error:

Loading 1 new model loading in lowvram mode 256.0 WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu. Requested to load BaseModel Requested to load ControlNet Loading 2 new models WARNING:accelerate.big_modeling:You shouldn't move a model when it is dispatched on multiple devices. loading in lowvram mode 256.0 ERROR:root:!!! Exception during processing !!! ERROR:root:Traceback (most recent call last): File "D:\ComfyUI\execution.py", line 153, in recursive_execute output_data, output_ui = get_output_data(obj, input_data_all) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI\execution.py", line 83, in get_output_data return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI\execution.py", line 76, in map_node_over_list results.append(getattr(obj, func)(*slice_dict(input_data_all, i))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI\custom_nodes\ComfyUI-Impact-Pack\modules\impact\segs_nodes.py", line 104, in doit segs, cnet_pil_list = SEGSDetailer.do_detail(image, segs, guide_size, guide_size_for, max_size, seed, steps, cfg, sampler_name, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI\custom_nodes\ComfyUI-Impact-Pack\modules\impact\segs_nodes.py", line 81, in do_detail enhanced_pil, cnet_pil = core.enhance_detail(cropped_image, model, clip, vae, guide_size, guide_size_for, max_size, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI\custom_nodes\ComfyUI-Impact-Pack\modules\impact\core.py", line 269, in enhance_detail refined_latent = ksampler_wrapper(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI\custom_nodes\ComfyUI-Impact-Pack\modules\impact\core.py", line 72, in ksampler_wrapper nodes.KSampler().sample(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, File "D:\ComfyUI\nodes.py", line 1299, in sample return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI\nodes.py", line 1269, in common_ksampler samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI\custom_nodes\ComfyUI-Impact-Pack\modules\impact\sample_error_enhancer.py", line 9, in informative_sample return original_sample(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI\comfy\sample.py", line 93, in sample real_model, positive_copy, negative_copy, noise_mask, models = prepare_sampling(model, noise.shape, positive, negative, noise_mask) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI\comfy\sample.py", line 86, in prepare_sampling comfy.model_management.load_models_gpu([model] + models, model.memory_required([noise_shape[0] 2] + list(noise_shape[1:])) + inference_memory) File "D:\ComfyUI\comfy\model_management.py", line 410, in load_models_gpu cur_loaded_model = loaded_model.model_load(lowvram_model_memory) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI\comfy\model_management.py", line 297, in model_load device_map = accelerate.infer_auto_device_map(self.real_model, max_memory={0: "{}MiB".format(lowvram_model_memory // (1024 1024)), "cpu": "16GiB"}) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\username\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\utils\modeling.py", line 978, in infer_auto_device_map module_sizes = compute_module_sizes(model, dtype=dtype, special_dtypes=special_dtypes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\username\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\utils\modeling.py", line 616, in compute_module_sizes size = tensor.numel() dtype_byte_size(tensor.dtype) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\username\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\utils\modeling.py", line 115, in dtype_byte_size raise ValueError(f"dtype is not a valid dtype: {dtype}.") ValueError: dtype is not a valid dtype: torch.float8_e4m3fn

FNSpd commented 10 months ago

So, I guess UNET doesn't have manual cast yet since --fp8_e4m3fn-text-enc works fine with latest commit

NeedsMoar commented 10 months ago

fp8 types weren't supported until Ada / Hopper AFAIK? Earlier cards probably had int8 but downcast to that isn't implemented yet (it can work with SD2.x models, at least, Shark supports it). Even on those using them without TransformerEngine using hardware to try to pick the correct version of fp8 or deciding to upcast depending on input sampling is kinda iffy, but TransformerEngine doesn't have a Windows build yet.

comfyanonymous commented 10 months ago

It should work now. Since you have a 16xx series card can you give me your speed for the default workflow with default comfyui settings vs --fp16-unet vs --fp8_e4m3fn-unet ?

FNSpd commented 10 months ago

It should work now. Since you have a 16xx series card can you give me your speed for the default workflow with default comfyui settings vs --fp16-unet vs --fp8_e4m3fn-unet ?

Yeah, it works now. Thanks for your work.

FP8: 100%|##################################################################################| 20/20 [00:20<00:00, 1.00s/it] Requested to load AutoencoderKL Loading 1 new model Prompt executed in 28.27 seconds

FP16:100%|##################################################################################| 20/20 [00:20<00:00, 1.01s/it] Requested to load AutoencoderKL Loading 1 new model Prompt executed in 24.66 seconds

Not that much of a difference for ~2 times memory cost reduction. Especially for us people with 4GB VRAM

FNSpd commented 10 months ago

A little update: E4M3FN implementation throws errors oftenly (couldn't find reason cause it just happens randomly). E5M2 implementation works perfectly fine

comfyanonymous commented 9 months ago

You can open another issue with those E4M3FN errors if they are still a problem.