[Bug]: Linux: SDXL-based models fail to load, PyTorch error

prmbittencourt commented 7 months ago

Checklist

[X] The issue exists after disabling all extensions
[X] The issue exists on a clean installation of webui
[ ] The issue is caused by an extension, but I believe it is caused by a bug in the webui
[X] The issue exists in the current version of the webui
[X] The issue has not been reported before recently
[X] The issue has been reported before but has not been fixed yet

What happened?

Whenever I select an SDXL model from the dropdown list at the top of the page, including the SDXL base model, it fails to load. The terminal output shows the following error: AttributeError: module 'torch' has no attribute 'float8_e4m3fn'.

Steps to reproduce the problem

Launch the WebUI.
Click the "down" arrow below "Stable Diffusion checkpoint" at the top left of the page.
Select an SDXL model from the dropdown list.
After a few seconds processing, the error will be printed to the terminal output and the selection will return to the previously selected model.

What should have happened?

The model should load.

What browsers do you use to access the UI ?

Mozilla Firefox

Sysinfo

sysinfo-2024-04-18-15-34.json

Console logs

################################################################
Launching launch.py...
################################################################
Python 3.11.8 (main, Feb 12 2024, 14:50:05) [GCC 13.2.1 20230801]
Version: v1.9.0
Commit hash: adadb4e3c7382bf3e4f7519126cd6c70f4f8557b
Launching Web UI with arguments: --skip-torch-cuda-test --upcast-sampling --opt-sub-quad-attention --medvram-sdxl
2024-04-18 12:28:22.419346: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
==============================================================================
You are running torch 2.0.1+rocm5.4.2.
The program is tested to work with torch 2.1.2.
To reinstall the desired version, run with commandline flag --reinstall-torch.
Beware that this will cause a lot of large files to be downloaded, as well as
there are reports of issues with training tab on the latest version.

Use --skip-version-check commandline argument to disable this check.
==============================================================================
*** "Disable all extensions" option was set, will only load built-in extensions ***
Loading weights [fbc31a67aa] from /opt/stable-diffusion-web-ui/models/Stable-diffusion/instruct-pix2pix-00-22000.safetensors
Running on local URL:  http://127.0.0.1:7860
Creating model from config: /opt/stable-diffusion-web-ui/configs/instruct-pix2pix.yaml
LatentDiffusion: Running in eps-prediction mode
Applying attention optimization: sub-quadratic... done.
Model loaded in 2.1s (load weights from disk: 0.5s, create model: 0.2s, apply weights to model: 1.1s, calculate empty prompt: 0.2s).

To create a public link, set `share=True` in `launch()`.
Startup time: 17.6s (import torch: 2.6s, import gradio: 1.1s, setup paths: 10.3s, other imports: 0.4s, load scripts: 0.4s, create ui: 0.4s, gradio launch: 2.2s).
Loading model sd_xl_base_1.0.safetensors [31e35c80fc] (2 out of 2)
Loading weights [31e35c80fc] from /opt/stable-diffusion-web-ui/models/Stable-diffusion/sd_xl_base_1.0.safetensors
Creating model from config: /opt/stable-diffusion-web-ui/repositories/generative-models/configs/inference/sd_xl_base.yaml
changing setting sd_model_checkpoint to sd_xl_base_1.0.safetensors [31e35c80fc]: AttributeError
Traceback (most recent call last):
  File "/opt/stable-diffusion-web-ui/modules/options.py", line 165, in set
    option.onchange()
  File "/opt/stable-diffusion-web-ui/modules/call_queue.py", line 13, in f
    res = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/stable-diffusion-web-ui/modules/initialize_util.py", line 181, in <lambda>
    shared.opts.onchange("sd_model_checkpoint", wrap_queued_call(lambda: sd_models.reload_model_weights()), call=False)
                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/stable-diffusion-web-ui/modules/sd_models.py", line 860, in reload_model_weights
    sd_model = reuse_model_from_already_loaded(sd_model, checkpoint_info, timer)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/stable-diffusion-web-ui/modules/sd_models.py", line 826, in reuse_model_from_already_loaded
    load_model(checkpoint_info)
  File "/opt/stable-diffusion-web-ui/modules/sd_models.py", line 748, in load_model
    load_model_weights(sd_model, checkpoint_info, state_dict, timer)
  File "/opt/stable-diffusion-web-ui/modules/sd_models.py", line 448, in load_model_weights
    module.to(torch.float8_e4m3fn)
              ^^^^^^^^^^^^^^^^^^^
AttributeError: module 'torch' has no attribute 'float8_e4m3fn'

Additional information

SD1.5 models work. Tested on fully up-to-date EndeavourOS.

w-e-w commented 7 months ago

hint

==============================================================================
You are running torch 2.0.1+rocm5.4.2.
The program is tested to work with torch 2.1.2.
To reinstall the desired version, run with commandline flag --reinstall-torch.
Beware that this will cause a lot of large files to be downloaded, as well as
there are reports of issues with training tab on the latest version.

Use --skip-version-check commandline argument to disable this check.
============================================================================

https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/14031

prmbittencourt commented 7 months ago

Hi, thanks for your input. I ran the script with --reinstall-torch and am now on Torch 2.2.2+rocm5.7. Loading the SDXL model works but every time I generate an image, I get the following error:

==========================================================================================s/it]
A tensor with all NaNs was produced in VAE.
Web UI will now convert VAE into 32-bit float and retry.
To disable this behavior, disable the 'Automatically revert VAE to 32-bit floats' setting.
To always start with 32-bit VAE, use --no-half-vae commandline flag.
==========================================================================================

I'm not sure if it's related to the original problem or not.

w-e-w commented 7 months ago

that is not an error message

a message does not equal error message it is telling you what's happening if it is actually an error you won't say SDXL model works but every time

what vae are you using sdxl-vae-fp16-fix if you are not using the nan fix onw then irrc it's it is quite likely to get nan in vae download https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/blob/main/sdxl.vae.safetensors and place it in you vae dir select it when using sdxl in setting / quick setting or configure XL models to use this vae in the checkpoints tab card icon

I never tried fp8 my self so I can't be sure, but if you want my guess assuming if fp8 is used during vae then I suspect it will only increase the chance of nan

prmbittencourt commented 7 months ago

Toggling the fp8 option seems to have fixed it.

kode54 commented 7 months ago

Doesn't SDXL require Python 3.10 and not anything newer?

w-e-w commented 7 months ago

3.10 is what we test webui on it doesn't necessarily mean that it wouldn't work with other versions

but if you're using a new version you might run into issues like if you're packing version is too new and the package hasn't been updated for that version yet or on the other hand the version of python can be too new and the package we use is no longer available on that version

AUTOMATIC1111 / stable-diffusion-webui