[Bug]: All vesion of a1111 stopped working at same time.

leucome commented 11 months ago

Edit: This was most likely caused by an extension installing/updating bitsandbytes. See ruler88 comment for the a short explanation and possible fix.

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

So I generated image then went to sleep then woke up and A1111 was not working anymore. While trying to diagnose... I noticed that all version installed also wont work anymore version 1.4 1.5.1 and Dev branch (bd4b4292ef6c2cb0a452b7105485ec06301b7531) and the 1.4 from the developer of Restart sampler.

Some version start but can not load any SD model while other just crash at launch with undefined symbol: cudaRuntimeGetVersion

Things I already tried:

Re-install from scratch version 1.4 1.5.1 an also the dev branch Re-Install ROCM5.5 Reinstall ROCm 5.6 install torch for 5.5 and torch for 5.6 Reset my entire system to a one week old state with Timeshift. Create the venv from a local installation of pyton 3.10 Create the venv from a miniconda installation of python 3.10 Re-install miniconda environment from scratch too in case. HSA_OVERRIDE_GFX_VERSION='11.0.0'

I was able to confirm that... Rocm with the 7900xt and PyTorch are definitively working. Comfy UI work fine and Vlad webui also work fine. It only affect A1111... So it is probably something that all these A1111 version share, Most likely something that can update itself at launch because it started by itself during nighttime without any manual update or reboot.

Seriously I am a bit confused.

Steps to reproduce the problem

sleep for a couple hours then hope it will brake itself.

What should have happened?

Ideally no self destruction.

Version or Commit where the problem happens

multiple version

What Python version are you running on ?

Python 3.10.x

What platforms do you use to access the UI ?

Linux

What device are you running WebUI on?

Other GPUs

Cross attention optimization

Automatic

What browsers do you use to access the UI ?

Mozilla Firefox

Command Line Arguments

--medvram --no-half-vae   and nothing

List of extensions

I also tried with none.

Console logs

There is log with the two most common error. Look I made so many test already that I also got other error but these are probably not relevant... 

------
------log 1.5.1
------
/mnt/4TB/Downloads/A1111/stable-diffusion-webui-1.5.1/venv/lib/python3.10/site-packages/torch/cuda/__init__.py:611: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
Loading weights [7671c36151] from /mnt/4TB/Downloads/A1111/stable-diffusion-webui-1.5.1/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors
loading stable diffusion model: SafetensorError
Traceback (most recent call last):
  File "/home/leucome/miniconda3/lib/python3.10/threading.py", line 973, in _bootstrap
    self._bootstrap_inner()
  File "/home/leucome/miniconda3/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/leucome/miniconda3/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/4TB/Downloads/A1111/stable-diffusion-webui-1.5.1/webui.py", line 318, in load_model
    shared.sd_model  # noqa: B018
  File "/mnt/4TB/Downloads/A1111/stable-diffusion-webui-1.5.1/modules/shared.py", line 754, in sd_model
    return modules.sd_models.model_data.get_sd_model()
  File "/mnt/4TB/Downloads/A1111/stable-diffusion-webui-1.5.1/modules/sd_models.py", line 439, in get_sd_model
    load_model()
  File "/mnt/4TB/Downloads/A1111/stable-diffusion-webui-1.5.1/modules/sd_models.py", line 481, in load_model
    state_dict = get_checkpoint_state_dict(checkpoint_info, timer)
  File "/mnt/4TB/Downloads/A1111/stable-diffusion-webui-1.5.1/modules/sd_models.py", line 277, in get_checkpoint_state_dict
    res = read_state_dict(checkpoint_info.filename)
  File "/mnt/4TB/Downloads/A1111/stable-diffusion-webui-1.5.1/modules/sd_models.py", line 253, in read_state_dict
    pl_sd = safetensors.torch.load_file(checkpoint_file, device=device)
  File "/mnt/4TB/Downloads/A1111/stable-diffusion-webui-1.5.1/venv/lib/python3.10/site-packages/safetensors/torch.py", line 259, in load_file
    with safe_open(filename, framework="pt", device=device) as f:
safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer

Stable diffusion model failed to load
Applying attention optimization: Doggettx... done.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 4.7s (launcher: 1.1s, import torch: 1.7s, import gradio: 0.4s, setup paths: 0.4s, other imports: 0.3s, list SD models: 0.1s, load scripts: 0.2s, create ui: 0.3s, gradio launch: 0.2s).

Stable diffusion model failed to load
Loading weights [7671c36151] from /mnt/4TB/Downloads/A1111/stable-diffusion-webui-1.5.1/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors
changing setting sd_model_checkpoint to v1-5-pruned-emaonly.safetensors [7671c36151]: SafetensorError
Traceback (most recent call last):
  File "/mnt/4TB/Downloads/A1111/stable-diffusion-webui-1.5.1/modules/shared.py", line 633, in set
    self.data_labels[key].onchange()
  File "/mnt/4TB/Downloads/A1111/stable-diffusion-webui-1.5.1/modules/call_queue.py", line 14, in f
    res = func(*args, **kwargs)
  File "/mnt/4TB/Downloads/A1111/stable-diffusion-webui-1.5.1/webui.py", line 238, in <lambda>
    shared.opts.onchange("sd_model_checkpoint", wrap_queued_call(lambda: modules.sd_models.reload_model_weights()), call=False)
  File "/mnt/4TB/Downloads/A1111/stable-diffusion-webui-1.5.1/modules/sd_models.py", line 570, in reload_model_weights
    state_dict = get_checkpoint_state_dict(checkpoint_info, timer)
  File "/mnt/4TB/Downloads/A1111/stable-diffusion-webui-1.5.1/modules/sd_models.py", line 277, in get_checkpoint_state_dict
    res = read_state_dict(checkpoint_info.filename)
  File "/mnt/4TB/Downloads/A1111/stable-diffusion-webui-1.5.1/modules/sd_models.py", line 253, in read_state_dict
    pl_sd = safetensors.torch.load_file(checkpoint_file, device=device)
  File "/mnt/4TB/Downloads/A1111/stable-diffusion-webui-1.5.1/venv/lib/python3.10/site-packages/safetensors/torch.py", line 259, in load_file
    with safe_open(filename, framework="pt", device=device) as f:
safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer

------
------Log 1.4
------
[leucome@Ryzen-One stable-diffusion-webui]$ ./webui.sh

################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye)
################################################################

################################################################
Running on leucome user
################################################################

################################################################
Repo already cloned, using it as install directory
################################################################

################################################################
Create and activate python venv
################################################################

################################################################
Launching launch.py...
################################################################
Using TCMalloc: libtcmalloc_minimal.so.4
fatal: No names found, cannot describe anything.
Python 3.10.9 (main, Jan 11 2023, 15:21:40) [GCC 11.2.0]
Version: ## 1.4.0
Commit hash: 8de6d3ff77e841a5fd9d5f1b16bdd22737c8d657
Installing requirements

No module 'xformers'. Proceeding without it.
If submitting an issue on github, please provide the full startup log for debugging purposes.

Initializing Dreambooth
Dreambooth revision: b4053defa6ae018b2ea56ac243aa55063f76fe0e
Successfully installed accelerate-0.21.0 fastapi-0.94.1 gitpython-3.1.32 transformers-4.30.2

Does your project take forever to startup?
Repetitive dependency installation may be the reason.
Automatic1111's base project sets strict requirements on outdated dependencies.
If an extension is using a newer version, the dependency is uninstalled and reinstalled twice every startup.

[!] xformers NOT installed.
[+] torch version 2.1.0.dev20230814+rocm5.5 installed.
[+] torchvision version 0.16.0.dev20230814+rocm5.5 installed.
[+] accelerate version 0.21.0 installed.
[+] diffusers version 0.19.3 installed.
[+] transformers version 4.30.2 installed.
[+] bitsandbytes version 0.35.4 installed.

Launching Web UI with arguments: 
/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/cuda/__init__.py:611: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
Traceback (most recent call last):
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1086, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
  File "/home/leucome/miniconda3/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 85, in <module>
    from accelerate import __version__ as accelerate_version
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/accelerate/__init__.py", line 3, in <module>
    from .accelerator import Accelerator
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 35, in <module>
    from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/accelerate/checkpointing.py", line 24, in <module>
    from .utils import (
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/accelerate/utils/__init__.py", line 131, in <module>
    from .bnb import has_4bit_bnb_layers, load_and_quantize_model
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/accelerate/utils/bnb.py", line 42, in <module>
    import bitsandbytes as bnb
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/bitsandbytes/__init__.py", line 6, in <module>
    from .autograd._functions import (
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 5, in <module>
    import bitsandbytes.functional as F
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/bitsandbytes/functional.py", line 13, in <module>
    from .cextension import COMPILED_WITH_CUDA, lib
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 113, in <module>
    lib = CUDASetup.get_instance().lib
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 109, in get_instance
    cls._instance.initialize()
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 59, in initialize
    binary_name, cudart_path, cuda, cc, cuda_version_string = evaluate_cuda_setup()
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py", line 125, in evaluate_cuda_setup
    cuda_version_string = get_cuda_version(cuda, cudart_path)
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py", line 45, in get_cuda_version
    check_cuda_result(cuda, cudart.cudaRuntimeGetVersion(ctypes.byref(version)))
  File "/home/leucome/miniconda3/lib/python3.10/ctypes/__init__.py", line 387, in __getattr__
    func = self.__getitem__(name)
  File "/home/leucome/miniconda3/lib/python3.10/ctypes/__init__.py", line 392, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: python3: undefined symbol: cudaRuntimeGetVersion

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/launch.py", line 38, in <module>
    main()
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/launch.py", line 34, in main
    start()
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/modules/launch_utils.py", line 340, in start
    import webui
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/webui.py", line 28, in <module>
    import pytorch_lightning   # noqa: F401 # pytorch_lightning should be imported after torch, but it re-enables warnings on import so import once to disable them
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/pytorch_lightning/__init__.py", line 35, in <module>
    from pytorch_lightning.callbacks import Callback  # noqa: E402
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/pytorch_lightning/callbacks/__init__.py", line 14, in <module>
    from pytorch_lightning.callbacks.batch_size_finder import BatchSizeFinder
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/pytorch_lightning/callbacks/batch_size_finder.py", line 24, in <module>
    from pytorch_lightning.callbacks.callback import Callback
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/pytorch_lightning/callbacks/callback.py", line 25, in <module>
    from pytorch_lightning.utilities.types import STEP_OUTPUT
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/pytorch_lightning/utilities/types.py", line 27, in <module>
    from torchmetrics import Metric
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/torchmetrics/__init__.py", line 14, in <module>
    from torchmetrics import functional  # noqa: E402
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/torchmetrics/functional/__init__.py", line 121, in <module>
    from torchmetrics.functional.text._deprecated import _bleu_score as bleu_score
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/torchmetrics/functional/text/__init__.py", line 31, in <module>
    from torchmetrics.functional.text.bert import bert_score
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/torchmetrics/functional/text/bert.py", line 25, in <module>
    from torchmetrics.functional.text.helper_embedding_metric import (
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/torchmetrics/functional/text/helper_embedding_metric.py", line 27, in <module>
    from transformers import AutoModelForMaskedLM, AutoTokenizer, PreTrainedModel, PreTrainedTokenizerBase
  File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1076, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/mnt/4TB/Downloads/A1111/restartsampler/stable-diffusion-webui/venv/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1088, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
python3: undefined symbol: cudaRuntimeGetVersion

--------------------------

Additional information

My Linux Distro is Manjaro. You know in case it is an issue specific to Manjaro I doubt but who know.

akx commented 10 months ago

I'd look at system logs for the time you were asleep (journalctl) to figure out if e.g. something was automatically updated in the background.

leucome commented 10 months ago

I'd look at system logs for the time you were asleep (journalctl) to figure out if e.g. something was automatically updated in the background.

I could not find anything helpful in the log.
So I tried deleting pip cache to force a re-download software installed by pip. Reinstalled every single packages on the system with pacman. Then I tried a new user. I also tried with python 3.11. So at least I know it is not caused by a damaged files or a wrong user setting.

So next I'll format then re-install the OS. The bug report was there long enough to have somebody confirming that they have the same issue if it was caused by an update. Also a1111 still work on my second computer on same OS. The only difference is that the other computer have a 6700xt. There still a small chance that it is caused by an update that affect only 7000GPU... I'll know for sure after re-installing the OS.

Though I still wonder what, why and how it can affect every a1111 version but none of the other stable-diffusion.

leucome commented 10 months ago

Finally completely re-installing the system worked. So it worked but I'll never know what was wrong.

SamSaffron commented 10 months ago

just got the same thing when I updated today ... but a whole OS install ain't gonna happen, going to need to figure this out...

mateuspestana commented 10 months ago

The same is happening to me. The problem started to appear right after an update of the extension dreambooth.

SamSaffron commented 10 months ago

I tried dev branch as well, also failing. python3: undefined symbol: cudaRuntimeGetVersion looks like a possible bug in bitsandbytes.

huuck commented 10 months ago

This is till happening randomly. Just rebooted my machine (working fine before, no updates no nothing and it happened again).

huuck commented 10 months ago

Also why is it closed?

leucome commented 10 months ago

Also why is it closed?

I closed it because it looked like a local issue with my system OS/config or something. But since then other people had really similar error messages after updating. So maybe it is really an update that brake something. So I guess I'll re-open.

DaveParr commented 10 months ago

Same issue, Pop!_os

SamSaffron commented 10 months ago

I had an issue with an extension depending on bitsandbites removing it fixed boot for me

On Tue, 29 Aug 2023 at 8:06 pm, Dave Parr @.***> wrote:

Same issue, Pop!_os

— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/12590#issuecomment-1697143630, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABIXPRS47AI4PQ25VY4HLXXW5JXANCNFSM6AAAAAA3RRPOF4 . You are receiving this because you commented.Message ID: @.***>

huuck commented 10 months ago

Alright so after closing it and firing up A1111 again it crashed with another random library error. What fixed it for good was this:

export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libcudart.so export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libcudart.so.11.5.117 export LD_LIBRARY_PATH="/usr/lib/x86_64-linux-gnu/:$LD_LIBRARY_PATH"

Seems to be some kind of dependency issue. Get your sorbet together, ML community :/

ruler88 commented 10 months ago

For people trying to decipher all the previous comments and just want a quick fix - in your stable-diffusion-webui, run:

pip uninstall bitsandbytes

And run ./webui.sh should work again.

But it's possible that one of the extensions is installing bitsandbytes. For me, it was the dreambooth extension. For a better fix, you need to figure out the directory where your cuda is installed and run: export LD_LIBRARY_PATH="/<cuda dir>:$LD_LIBRARY_PATH"

If you are on a linux machine, it's likely somewhere in /usr/local or /usr/lib

leucome commented 10 months ago

Ho yeah I do use Dreambooth extension to train Lora. So it is pretty sure that I had bitsandbytes installed.

DaveParr commented 10 months ago

Moved over to the dockerised versions. Seems to solve it currently.

AUTOMATIC1111 / stable-diffusion-webui