SDXL generate black images with new --fast arg

bananasss00 commented 3 months ago

Expected Behavior

-

Actual Behavior

SD15 and Flux work fine, the problem is only with SDXL

Comfyu version: https://github.com/comfyanonymous/ComfyUI/commit/bb4416dd5b2d7c2f34dc17e18761dd6b3d8b6ead

Steps to Reproduce

default workflow with SDXL model

Debug Logs

V:\comfyu_py311>.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build --fast --fp8_e4m3fn-unet --disable-all-custom-nodes --temp-directory "a:\comfyui-temp" --port 8190
Total VRAM 16376 MB, total RAM 130998 MB
pytorch version: 2.4.0+cu121
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4080 : cudaMallocAsync
Using pytorch cross attention
Setting temp directory to: a:\comfyui-temp\temp
[Prompt Server] web root: V:\comfyu_py311\ComfyUI\web
Adding extra search path checkpoints V:/auto1111-webui/models/Stable-diffusion
Adding extra search path configs V:/auto1111-webui/models/Stable-diffusion
Adding extra search path vae V:/auto1111-webui/models/VAE
Adding extra search path loras V:/auto1111-webui/models/Lora
Adding extra search path loras V:/auto1111-webui/models/LyCORIS
Adding extra search path upscale_models V:/auto1111-webui/models/ESRGAN
Adding extra search path upscale_models V:/auto1111-webui/models/RealESRGAN
Adding extra search path upscale_models V:/auto1111-webui/models/SwinIR
Adding extra search path embeddings V:/auto1111-webui/models/embeddings
Adding extra search path hypernetworks V:/auto1111-webui/models/hypernetworks
Adding extra search path controlnet V:/auto1111-webui/models/ControlNet
V:\comfyu_py311\python_embeded\Lib\site-packages\kornia\feature\lightglue.py:44: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)
Skipping loading of custom nodes
Starting server

To see the GUI go to: http://127.0.0.1:8190
got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.float16
model_type EPS
Using pytorch attention in VAE
Using pytorch attention in VAE
V:\comfyu_py311\python_embeded\Lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
Requested to load SDXLClipModel
Loading 1 new model
loaded completely 0.0 1560.802734375 True
V:\comfyu_py311\ComfyUI\comfy\ldm\modules\attention.py:407: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
  out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
Requested to load SDXL
Loading 1 new model
loaded completely 0.0 2448.5241737365723 True
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:03<00:00,  5.98it/s]
Requested to load AutoencoderKL
Loading 1 new model
loaded completely 0.0 159.55708122253418 True
V:\comfyu_py311\ComfyUI\nodes.py:1498: RuntimeWarning: invalid value encountered in cast
  img = Image.fromarray(np.clip(i, 0, 255).astype(np.uint8))
Prompt executed in 12.89 seconds
got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.float16
model_type EPS
Using pytorch attention in VAE
Using pytorch attention in VAE
Requested to load SDXLClipModel
Loading 1 new model
loaded completely 0.0 1560.802734375 True
Requested to load SDXL
Loading 1 new model
loaded completely 0.0 2448.5241737365723 True
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:03<00:00,  6.66it/s]
Requested to load AutoencoderKL
Loading 1 new model
loaded completely 0.0 159.55708122253418 True
Prompt executed in 37.42 seconds

Other

No response

ltdrdata commented 3 months ago

Your torch version is cu121. Update it to cu124 version.

bananasss00 commented 3 months ago

Your torch version is cu121. Update it to cu124 version.

same problem with cu124. First generation SD15 - ok, second SDXL,

debug log

```batch V:\comfyu_py311>.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build --fast --fp8_e4m3fn-unet --disable-all-custom-nodes --temp-directory "a:\comfyui-temp" Total VRAM 16376 MB, total RAM 130998 MB pytorch version: 2.4.0+cu124 Set vram state to: NORMAL_VRAM Device: cuda:0 NVIDIA GeForce RTX 4080 : cudaMallocAsync Using pytorch cross attention Setting temp directory to: a:\comfyui-temp\temp [Prompt Server] web root: V:\comfyu_py311\ComfyUI\web Adding extra search path checkpoints V:/auto1111-webui/models/Stable-diffusion Adding extra search path configs V:/auto1111-webui/models/Stable-diffusion Adding extra search path vae V:/auto1111-webui/models/VAE Adding extra search path loras V:/auto1111-webui/models/Lora Adding extra search path loras V:/auto1111-webui/models/LyCORIS Adding extra search path upscale_models V:/auto1111-webui/models/ESRGAN Adding extra search path upscale_models V:/auto1111-webui/models/RealESRGAN Adding extra search path upscale_models V:/auto1111-webui/models/SwinIR Adding extra search path embeddings V:/auto1111-webui/models/embeddings Adding extra search path hypernetworks V:/auto1111-webui/models/hypernetworks Adding extra search path controlnet V:/auto1111-webui/models/ControlNet V:\comfyu_py311\python_embeded\Lib\site-packages\kornia\feature\lightglue.py:44: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32) Skipping loading of custom nodes Starting server To see the GUI go to: http://127.0.0.1:8188 got prompt model weight dtype torch.float8_e4m3fn, manual cast: torch.float16 model_type EPS Using pytorch attention in VAE Using pytorch attention in VAE V:\comfyu_py311\python_embeded\Lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 warnings.warn( Requested to load SD1ClipModel Loading 1 new model loaded completely 0.0 235.84423828125 True V:\comfyu_py311\ComfyUI\comfy\ldm\modules\attention.py:407: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.) out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False) Requested to load BaseModel Loading 1 new model loaded completely 0.0 819.703067779541 True 100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 17.15it/s] Requested to load AutoencoderKL Loading 1 new model loaded completely 0.0 159.55708122253418 True Prompt executed in 8.37 seconds got prompt model weight dtype torch.float8_e4m3fn, manual cast: torch.float16 model_type EPS Using pytorch attention in VAE Using pytorch attention in VAE Requested to load SDXLClipModel Loading 1 new model loaded completely 0.0 1560.802734375 True Requested to load SDXL Loading 1 new model loaded completely 0.0 2448.5241737365723 True 100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00, 8.51it/s] Requested to load AutoencoderKL Loading 1 new model loaded completely 0.0 159.55708122253418 True V:\comfyu_py311\ComfyUI\nodes.py:1498: RuntimeWarning: invalid value encountered in cast img = Image.fromarray(np.clip(i, 0, 255).astype(np.uint8)) Prompt executed in 36.40 seconds ```

bananasss00 commented 3 months ago

I attempted to download the latest ComfyUI from https://github.com/comfyanonymous/ComfyUI/releases/tag/v0.1.2. Afterward, I updated ComfyUI and torch+cu124. However, the issue with the SDXL model persists.

ltdrdata commented 3 months ago

I attempted to download the latest ComfyUI from https://github.com/comfyanonymous/ComfyUI/releases/tag/v0.1.2. Afterward, I updated ComfyUI and torch+cu124. However, the issue with the SDXL model persists.

--fp8_e4m3fn-unet This option is the problem.

bananasss00 commented 3 months ago

I attempted to download the latest ComfyUI from https://github.com/comfyanonymous/ComfyUI/releases/tag/v0.1.2. Afterward, I updated ComfyUI and torch+cu124. However, the issue with the SDXL model persists.

--fp8_e4m3fn-unet This option is the problem.

The --fast optimization is specifically designed for fp8_e4m3fn. If I disable it, there will be no optimization

ltdrdata commented 3 months ago

I attempted to download the latest ComfyUI from https://github.com/comfyanonymous/ComfyUI/releases/tag/v0.1.2. Afterward, I updated ComfyUI and torch+cu124. However, the issue with the SDXL model persists.

--fp8_e4m3fn-unet This option is the problem.

The --fast optimization is specifically designed for fp8_e4m3fn. If I disable it, there will be no optimization

In fact, the background for introducing that option was to help when loading and using FLUX.1 through the Load Diffusion Model function. It seems that the CLI option in question had not been tested. I have already communicated this issue to comfy.

bananasss00 commented 3 months ago

Understood, thank you. It's strange that with this option, SD15 and Flux work normally.

jkrauss82 commented 2 months ago

I found that the pony base model works fine with --fp8_e4m3fn-unet, however other SDXL variants do not.

Remember2015 commented 2 months ago

any updates?

adamjen commented 2 months ago

Any updates??

jetjodh commented 5 days ago

Does the issue persist after using the fixed SDXL vae model?

jkrauss82 commented 5 days ago

yes it persists, preview image is already black when the ksampler is running. It does not seem to have to do with the vae, vae encode/decode is running fine with fp8, otherwise pony would have the same problem. I am using the regular fp16 fixed vanilla sdxl vae

adamjen commented 5 days ago

If you remove --fast from your Comfyui startup instructions does the issue persist?

jkrauss82 commented 5 days ago

No, but the point of this issue is that SDXL is not working with --fast. Without --fast things work as normal.

Currently, only pony models seem to work with --fast, older SDXL derivatives like Juggernaut or StarlightXL do not.

jkrauss82 commented 1 day ago

I did some digging and the problem seems related to this function used in forward_comfy_cast_weights. With only super-limited knowledge of what is going on under the hood I can only speculate we are getting some kind of "out of bounds" error where the available range covered by e4m3 float is not enough to support the range the forward step needs, causing the latent values to be corrupted.

I was hoping to try with e5m2 but this is not supported by cublast as pointed out by knowlegeable people here (thus, it is also not eligible for fp8_linear and comfyui won't execute the fast path when this dtype is chosen.

I would imagine that, if my speculation is true, we could mitigate the problem by applying some smarter kind of value scaling or maybe just use the possible min/max values should the function yield values exceeding the range covered by e4m3.

I would be happy to help developing / testing this further but I would need some pointers where to look next from someone with deeper knowledge of this context.

comfyanonymous / ComfyUI