WebUI crashes on ROCm with default provisioning script (probably due to bitsandbytes)

icodeforyou-dot-net commented 6 months ago

Trying to run this locally using a simple docker compose up, but on ROCm instead of CUDA. Sadly the webui keeps crashing down on me using the default provisioning script. It appears the error is related to bitsandbytes trying to find CUDA in vain, which is understandable as there is no official ROCm support in bitsandbytes to this date to my knowledge.

AMD does appear to maintain their own fork of bitsandbytes here however: https://github.com/ROCm/bitsandbytes/tree/rocm_enabled

So I think there are two questions then.

1) Shouldn't one somehow try to provide the ROCm images built with AMD's fork of bitsandbytes?

2) Or maybe one should identify all the things that include bitsandbytes as a dependency and remove them? Or at least provide people with some documentation which things don't work on ROCm at the moment? Sadly I don't know what that would entail, the provisioning scripts are quite sizable.

Any ideas, comments?

Thanks for any help in advance!

I am running this on NixOS, kernel 6.1.77, AMD Radeon Pro W6800 GPU

This is my docker-compose.yml:

version: "3.8"
# Compose file build variables set in .env
services:
  supervisor:
    platform: linux/amd64
    build:
      context: ./build
      args:
        IMAGE_BASE: ${IMAGE_BASE:-ghcr.io/ai-dock/jupyter-pytorch:2.2.0-py3.10-rocm-5.7-runtime-22.04}
      tags:
        - "ghcr.io/ai-dock/stable-diffusion-webui:${IMAGE_TAG:-jupyter-pytorch-2.2.0-py3.10-rocm-5.7-runtime-22.04}"

    image: ghcr.io/ai-dock/stable-diffusion-webui:${IMAGE_TAG:-jupyter-pytorch-2.2.0-py3.10-rocm-5.7-runtime-22.04}

    devices:
      - "/dev/dri:/dev/dri"
      # For AMD GPU
      - "/dev/kfd:/dev/kfd"

    volumes:
      # Workspace
      - ./workspace:${WORKSPACE:-/workspace/}:rshared
      # You can share /workspace/storage with other non-WEBUI containers. See README
      #- /path/to/common_storage:${WORKSPACE:-/workspace/}storage/:rshared
      # Will echo to root-owned authorized_keys file;
      # Avoids changing local file owner
      - ./config/authorized_keys:/root/.ssh/authorized_keys_mount
      - ./config/provisioning/default.sh:/opt/ai-dock/bin/provisioning.sh

    ports:
        # SSH available on host machine port 2222 to avoid conflict. Change to suit
        - ${SSH_PORT_HOST:-2222}:${SSH_PORT_LOCAL:-22}
        # Caddy port for service portal
        - ${SERVICEPORTAL_PORT_HOST:-1111}:${SERVICEPORTAL_PORT_HOST:-1111}
        # WEBUI web interface
        - ${WEBUI_PORT_HOST:-7860}:${WEBUI_PORT_HOST:-7860}
        # Jupyter server
        - ${JUPYTER_PORT_HOST:-8888}:${JUPYTER_PORT_HOST:-8888}
        # Syncthing
        - ${SYNCTHING_UI_PORT_HOST:-8384}:${SYNCTHING_UI_PORT_HOST:-8384}
        - ${SYNCTHING_TRANSPORT_PORT_HOST:-22999}:${SYNCTHING_TRANSPORT_PORT_HOST:-22999}

    environment:
        # Don't enclose values in quotes
        - DIRECT_ADDRESS=${DIRECT_ADDRESS:-127.0.0.1}
        - DIRECT_ADDRESS_GET_WAN=${DIRECT_ADDRESS_GET_WAN:-false}
        - WORKSPACE=${WORKSPACE:-/workspace}
        - WORKSPACE_SYNC=${WORKSPACE_SYNC:-false}
        - CF_TUNNEL_TOKEN=${CF_TUNNEL_TOKEN:-}
        - CF_QUICK_TUNNELS=${CF_QUICK_TUNNELS:-true}
        - WEB_ENABLE_AUTH=${WEB_ENABLE_AUTH:-true}
        - WEB_USER=${WEB_USER:-user}
        - WEB_PASSWORD=${WEB_PASSWORD:-password}
        - SSH_PORT_HOST=${SSH_PORT_HOST:-2222}
        - SSH_PORT_LOCAL=${SSH_PORT_LOCAL:-22}
        - SERVICEPORTAL_PORT_HOST=${SERVICEPORTAL_PORT_HOST:-1111}
        - SERVICEPORTAL_METRICS_PORT=${SERVICEPORTAL_METRICS_PORT:-21111}
        - WEBUI_BRANCH=${WEBUI_BRANCH:-}
        - WEBUI_FLAGS=${WEBUI_FLAGS:-}
        - WEBUI_PORT_HOST=${WEBUI_PORT_HOST:-7860}
        - WEBUI_PORT_LOCAL=${WEBUI_PORT_LOCAL:-17860}
        - WEBUI_METRICS_PORT=${WEBUI_METRICS_PORT:-27860}
        - JUPYTER_PORT_HOST=${JUPYTER_PORT_HOST:-8888}
        - JUPYTER_METRICS_PORT=${JUPYTER_METRICS_PORT:-28888}
        - SERVERLESS=${SERVERLESS:-false}
        - SYNCTHING_UI_PORT_HOST=${SYNCTHING_UI_PORT_HOST:-8384}
        - SYNCTHING_TRANSPORT_PORT_HOST=${SYNCTHING_TRANSPORT_PORT_HOST:-22999}
        #- PROVISIONING_SCRIPT=${PROVISIONING_SCRIPT:-}

And here is my error message for completeness sake, in case I am interpreting this the wrong way.

supervisor-1  | ==> /var/log/supervisor/webui.log <==
supervisor-1  | Starting A1111 SD Web UI...
supervisor-1  | Starting A1111 SD Web UI...
supervisor-1  | Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]
supervisor-1  | Version: v1.8.0
supervisor-1  | Commit hash: bef51aed032c0aaa5cfd80445bc4cf0d85b408b5
supervisor-1  | 2024-03-19 17:32:20,668 INFO success: webui entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
supervisor-1  | Installing requirements
supervisor-1  |
supervisor-1  | ==> /var/log/supervisor/supervisor.log <==
supervisor-1  | 2024-03-19 17:32:20,668 INFO success: webui entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
supervisor-1  |
supervisor-1  | ==> /var/log/supervisor/webui.log <==
supervisor-1  | False
supervisor-1  | 'CUDASetup' object has no attribute 'cuda_available'
supervisor-1  | no module 'xformers'. Processing without...
supervisor-1  | no module 'xformers'. Processing without...
supervisor-1  | No module 'xformers'. Proceeding without it.
supervisor-1  | If submitting an issue on github, please provide the full startup log for debugging purposes.
supervisor-1  |
supervisor-1  | Initializing Dreambooth
supervisor-1  | Dreambooth revision: 45a12fe5950bf93205b6ef2b7511eb94052a241f
supervisor-1  | Checking xformers...
supervisor-1  | Checking bitsandbytes...
supervisor-1  | Checking bitsandbytes (ALL!)
supervisor-1  | Checking Dreambooth requirements...
supervisor-1  | Installed version of bitsandbytes: 0.43.0
supervisor-1  | [Dreambooth] bitsandbytes v0.43.0 is already installed.
supervisor-1  | Installed version of accelerate: 0.21.0
supervisor-1  | [Dreambooth] accelerate v0.21.0 is already installed.
supervisor-1  | Installed version of dadaptation: 3.2
supervisor-1  | [Dreambooth] dadaptation v3.2 is already installed.
supervisor-1  | Installed version of diffusers: 0.27.1
supervisor-1  | [Dreambooth] diffusers v0.25.0 is already installed.
supervisor-1  | Installed version of discord-webhook: 1.3.0
supervisor-1  | [Dreambooth] discord-webhook v1.3.0 is already installed.
supervisor-1  | Installed version of fastapi: 0.94.0
supervisor-1  | [Dreambooth] fastapi is already installed.
supervisor-1  | Installed version of gitpython: 3.1.32
supervisor-1  | [Dreambooth] gitpython v3.1.40 is not installed.
supervisor-1  | Successfully installed gitpython-3.1.42
supervisor-1  | Installed version of pytorch_optimizer: 2.12.0
supervisor-1  | [Dreambooth] pytorch_optimizer v2.12.0 is already installed.
supervisor-1  | Installed version of Pillow: 9.5.0
supervisor-1  | [Dreambooth] Pillow is already installed.
supervisor-1  | Installed version of tqdm: 4.66.2
supervisor-1  | [Dreambooth] tqdm is already installed.
supervisor-1  | Installed version of tomesd: 0.1.3
supervisor-1  | [Dreambooth] tomesd v0.1.2 is already installed.
supervisor-1  | Installed version of tensorboard: 2.13.0
supervisor-1  | [Dreambooth] tensorboard v2.13.0 is already installed.
supervisor-1  | [+] torch version 2.2.0+rocm5.7 installed.
supervisor-1  | [+] torchvision version 0.17.0+rocm5.7 installed.
supervisor-1  | [+] accelerate version 0.21.0 installed.
supervisor-1  | [+] diffusers version 0.27.1 installed.
supervisor-1  | [+] bitsandbytes version 0.43.0 installed.
supervisor-1  | [!] xformers NOT installed.
supervisor-1  | False
supervisor-1  | 'CUDASetup' object has no attribute 'cuda_available'
supervisor-1  | no module 'xformers'. Processing without...
supervisor-1  | no module 'xformers'. Processing without...
supervisor-1  | No module 'xformers'. Proceeding without it.
supervisor-1  | Installing requirements for Face Editor
supervisor-1  | CUDA None
supervisor-1  | Launching Web UI with arguments: --port 17860
supervisor-1  | Traceback (most recent call last):
supervisor-1  |   File "/workspace/stable-diffusion-webui/launch.py", line 48, in <module>
supervisor-1  |     main()
supervisor-1  |   File "/workspace/stable-diffusion-webui/launch.py", line 44, in main
supervisor-1  |     start()
supervisor-1  |   File "/workspace/stable-diffusion-webui/modules/launch_utils.py", line 465, in start
supervisor-1  |     import webui
supervisor-1  |   File "/workspace/stable-diffusion-webui/webui.py", line 13, in <module>
supervisor-1  |     initialize.imports()
supervisor-1  |   File "/workspace/stable-diffusion-webui/modules/initialize.py", line 26, in imports
supervisor-1  |     from modules import paths, timer, import_hook, errors  # noqa: F401
supervisor-1  |   File "/workspace/stable-diffusion-webui/modules/paths.py", line 60, in <module>
supervisor-1  |     import sgm  # noqa: F401
supervisor-1  |   File "/workspace/stable-diffusion-webui/repositories/generative-models/sgm/__init__.py", line 1, in <module>
supervisor-1  |     from .models import AutoencodingEngine, DiffusionEngine
supervisor-1  |   File "/workspace/stable-diffusion-webui/repositories/generative-models/sgm/models/__init__.py", line 1, in <module>
supervisor-1  |     from .autoencoder import AutoencodingEngine
supervisor-1  |   File "/workspace/stable-diffusion-webui/repositories/generative-models/sgm/models/autoencoder.py", line 12, in <module>
supervisor-1  |     from ..modules.diffusionmodules.model import Decoder, Encoder
supervisor-1  |   File "/workspace/stable-diffusion-webui/repositories/generative-models/sgm/modules/__init__.py", line 1, in <module>
supervisor-1  |     from .encoders.modules import GeneralConditioner
supervisor-1  |   File "/workspace/stable-diffusion-webui/repositories/generative-models/sgm/modules/encoders/modules.py", line 5, in <module>
supervisor-1  |     import kornia
supervisor-1  |   File "/opt/micromamba/envs/webui/lib/python3.10/site-packages/kornia/__init__.py", line 11, in <module>
supervisor-1  |     from . import augmentation, color, contrib, core, enhance, feature, io, losses, metrics, morphology, tracking, utils, x
supervisor-1  |   File "/opt/micromamba/envs/webui/lib/python3.10/site-packages/kornia/x/__init__.py", line 2, in <module>
supervisor-1  |     from .trainer import Trainer
supervisor-1  |   File "/opt/micromamba/envs/webui/lib/python3.10/site-packages/kornia/x/trainer.py", line 11, in <module>
supervisor-1  |     from accelerate import Accelerator
supervisor-1  |   File "/opt/micromamba/envs/webui/lib/python3.10/site-packages/accelerate/__init__.py", line 3, in <module>
supervisor-1  |     from .accelerator import Accelerator
supervisor-1  |   File "/opt/micromamba/envs/webui/lib/python3.10/site-packages/accelerate/accelerator.py", line 35, in <module>
supervisor-1  |     from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
supervisor-1  |   File "/opt/micromamba/envs/webui/lib/python3.10/site-packages/accelerate/checkpointing.py", line 24, in <module>
supervisor-1  |     from .utils import (
supervisor-1  |   File "/opt/micromamba/envs/webui/lib/python3.10/site-packages/accelerate/utils/__init__.py", line 131, in <module>
supervisor-1  |     from .bnb import has_4bit_bnb_layers, load_and_quantize_model
supervisor-1  |   File "/opt/micromamba/envs/webui/lib/python3.10/site-packages/accelerate/utils/bnb.py", line 42, in <module>
supervisor-1  |     import bitsandbytes as bnb
supervisor-1  |   File "/opt/micromamba/envs/webui/lib/python3.10/site-packages/bitsandbytes/__init__.py", line 6, in <module>
supervisor-1  |     from . import cuda_setup, research, utils
supervisor-1  |   File "/opt/micromamba/envs/webui/lib/python3.10/site-packages/bitsandbytes/research/__init__.py", line 2, in <module>
supervisor-1  |     from .autograd._functions import (
supervisor-1  |   File "/opt/micromamba/envs/webui/lib/python3.10/site-packages/bitsandbytes/research/autograd/_functions.py", line 8, in <module>
supervisor-1  |     from bitsandbytes.autograd._functions import GlobalOutlierPooler, MatmulLtState
supervisor-1  |   File "/opt/micromamba/envs/webui/lib/python3.10/site-packages/bitsandbytes/autograd/__init__.py", line 1, in <module>
supervisor-1  |     from ._functions import get_inverse_transform_indices, undo_layout
supervisor-1  |   File "/opt/micromamba/envs/webui/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 10, in <module>
supervisor-1  |     import bitsandbytes.functional as F
supervisor-1  |   File "/opt/micromamba/envs/webui/lib/python3.10/site-packages/bitsandbytes/functional.py", line 17, in <module>
supervisor-1  |     from .cextension import COMPILED_WITH_CUDA, lib
supervisor-1  |   File "/opt/micromamba/envs/webui/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 10, in <module>
supervisor-1  |     setup.run_cuda_setup()
supervisor-1  |   File "/opt/micromamba/envs/webui/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py", line 137, in run_cuda_setup
supervisor-1  |     binary_name, cudart_path, cc, cuda_version_string = evaluate_cuda_setup()
supervisor-1  |   File "/opt/micromamba/envs/webui/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py", line 367, in evaluate_cuda_setup
supervisor-1  |     cuda_version_string = get_cuda_version()
supervisor-1  |   File "/opt/micromamba/envs/webui/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py", line 335, in get_cuda_version
supervisor-1  |     major, minor = map(int, torch.version.cuda.split("."))
supervisor-1  | AttributeError: 'NoneType' object has no attribute 'split'
supervisor-1  | 2024-03-19 17:32:37,349 INFO exited: webui (exit status 1; not expected)
supervisor-1  |
supervisor-1  | ==> /var/log/supervisor/supervisor.log <==
supervisor-1  | 2024-03-19 17:32:37,349 INFO exited: webui (exit status 1; not expected)
supervisor-1  | 2024-03-19 17:32:37,350 INFO spawned: 'webui' with pid 1746
supervisor-1  | 2024-03-19 17:32:37,350 INFO spawned: 'webui' with pid 1746

robballantyne commented 6 months ago

You may well be right. Unfortunately I don't have access to an AMD card so I'm building blind for the target.

The default provisioning script is just an example so feel free to replace it or run without it.

icodeforyou-dot-net commented 6 months ago

Thanks for getting back to me that quickly!

Unfortunately I don't have access to an AMD card so I'm building blind for the target.

I understand. Maybe I will have some free time and try getting the AMD fork of bitsandbytes into an image. There is also hope that bitsandbytes themselves may add ROCm support in the not too distant future.

The default provisioning script is just an example

Exploring the other path, I would just trim the default provisioning scripts and see if it works with the bare minimum. And then add more stuff again from there.

But maybe you have an idea which things include bitsandbytes as a dependency? (I am not very familiar with the stable diffusion ecosystem yet)

If I succeed, would you be willing to accept a PR making some changes to the default scripts? And maybe add a few lines in the readme? Provided they don't break things on Nvidia once I am done with them of course :slightly_smiling_face:

robballantyne commented 5 months ago

I have added a ROCm provisioning script to the config directory. Currently all it does is remove bitsandbytes but I'll add the AMD version at build time asap.

Sorry for the delay but I now have access to AMD hardware so I can make improvements going forward

ai-dock / stable-diffusion-webui

WebUI crashes on ROCm with default provisioning script (probably due to bitsandbytes) #14