[Bug]: RuntimeError: Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check

thefreemanever commented 10 months ago

Checklist

[ ] The issue exists after disabling all extensions
[ ] The issue exists on a clean installation of webui
[ ] The issue is caused by an extension, but I believe it is caused by a bug in the webui
[ ] The issue exists in the current version of the webui
[ ] The issue has not been reported before recently
[X] The issue has been reported before but has not been fixed yet

What happened?

I installed A1111 yesterday and it was working fine. Today I wanted to launch it again using "bash webui.sh" command as it seems this is the way of launching it on Linux(Ubuntu) but the result is:

RuntimeError: Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check

PS: I have two 4090 GPUs:


echo $CUDA_VISIBLE_DEVICES
0,1

Steps to reproduce the problem

try to install/launch on Ubuntu

What should have happened?

App should launch

What browsers do you use to access the UI ?

Mozilla Firefox

Sysinfo

sysinfo-2024-01-11-02-38.json

Console logs

################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye), Fedora 34+ and openSUSE Leap 15.4 or newer.
################################################################

################################################################
Running on a user
################################################################

################################################################
Repo already cloned, using it as install directory
################################################################

################################################################
Create and activate python venv
################################################################

################################################################
Launching launch.py...
################################################################
Cannot locate TCMalloc (improves CPU memory usage)
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
Version: v1.7.0
Commit hash: cf2772fab0af5573da775e7437e6acdca424f26e
Traceback (most recent call last):
  File "/home/a/ais/stable-diffusion-webui/launch.py", line 48, in <module>
    main()
  File "/home/a/ais/stable-diffusion-webui/launch.py", line 39, in main
    prepare_environment()
  File "/home/a/ais/stable-diffusion-webui/modules/launch_utils.py", line 384, in prepare_environment
    raise RuntimeError(
RuntimeError: Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check


### Additional information

I don't like to disable GPU, but even if I try "./webui.sh --skip-torch-cuda-test" I get the following result:

#####```
###########################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye), Fedora 34+ and openSUSE Leap 15.4 or newer.
################################################################

################################################################
Running on a user
################################################################

################################################################
Repo already cloned, using it as install directory
################################################################

################################################################
Create and activate python venv
################################################################

################################################################
Launching launch.py...
################################################################
Cannot locate TCMalloc (improves CPU memory usage)
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
Version: v1.7.0
Commit hash: cf2772fab0af5573da775e7437e6acdca424f26e
Cloning Stable Diffusion into /home/a/ais/stable-diffusion-webui/repositories/stable-diffusion-stability-ai...
Cloning into '/home/a/ais/stable-diffusion-webui/repositories/stable-diffusion-stability-ai'...
remote: Enumerating objects: 580, done.
remote: Counting objects: 100% (310/310), done.
remote: Compressing objects: 100% (92/92), done.
remote: Total 580 (delta 248), reused 218 (delta 218), pack-reused 270
Receiving objects: 100% (580/580), 73.43 MiB | 1.64 MiB/s, done.
Resolving deltas: 100% (280/280), done.
Cloning Stable Diffusion XL into /home/a/ais/stable-diffusion-webui/repositories/generative-models...
Cloning into '/home/a/ais/stable-diffusion-webui/repositories/generative-models'...
remote: Enumerating objects: 860, done.
remote: Counting objects: 100% (503/503), done.
remote: Compressing objects: 100% (233/233), done.
remote: Total 860 (delta 359), reused 309 (delta 266), pack-reused 357
Receiving objects: 100% (860/860), 42.67 MiB | 1.52 MiB/s, done.
Resolving deltas: 100% (435/435), done.
Cloning K-diffusion into /home/a/ais/stable-diffusion-webui/repositories/k-diffusion...
Cloning into '/home/a/ais/stable-diffusion-webui/repositories/k-diffusion'...
remote: Enumerating objects: 1329, done.
remote: Counting objects: 100% (611/611), done.
remote: Compressing objects: 100% (81/81), done.
remote: Total 1329 (delta 568), reused 538 (delta 530), pack-reused 718
Receiving objects: 100% (1329/1329), 239.04 KiB | 941.00 KiB/s, done.
Resolving deltas: 100% (931/931), done.
Cloning CodeFormer into /home/a/ais/stable-diffusion-webui/repositories/CodeFormer...
Cloning into '/home/a/ais/stable-diffusion-webui/repositories/CodeFormer'...
remote: Enumerating objects: 594, done.
remote: Counting objects: 100% (245/245), done.
remote: Compressing objects: 100% (88/88), done.
remote: Total 594 (delta 175), reused 173 (delta 157), pack-reused 349
Receiving objects: 100% (594/594), 17.31 MiB | 1.83 MiB/s, done.
Resolving deltas: 100% (286/286), done.
Cloning BLIP into /home/a/ais/stable-diffusion-webui/repositories/BLIP...
Cloning into '/home/a/ais/stable-diffusion-webui/repositories/BLIP'...
remote: Enumerating objects: 277, done.
remote: Counting objects: 100% (165/165), done.
remote: Compressing objects: 100% (30/30), done.
remote: Total 277 (delta 137), reused 136 (delta 135), pack-reused 112
Receiving objects: 100% (277/277), 7.03 MiB | 2.29 MiB/s, done.
Resolving deltas: 100% (152/152), done.
Launching Web UI with arguments: --skip-torch-cuda-test
2024-01-10 18:18:33.606598: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-01-10 18:18:33.627472: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-10 18:18:33.627494: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-10 18:18:33.628352: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-10 18:18:33.632224: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-10 18:18:34.032378: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.: str
Traceback (most recent call last):
  File "/home/a/ais/stable-diffusion-webui/modules/errors.py", line 98, in run
    code()
  File "/home/a/ais/stable-diffusion-webui/modules/devices.py", line 76, in enable_tf32
    device_id = (int(shared.cmd_opts.device_id) if shared.cmd_opts.device_id is not None and shared.cmd_opts.device_id.isdigit() else 0) or torch.cuda.current_device()
  File "/home/a/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 769, in current_device
    _lazy_init()
  File "/home/a/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 298, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/a/ais/stable-diffusion-webui/launch.py", line 48, in <module>
    main()
  File "/home/a/ais/stable-diffusion-webui/launch.py", line 44, in main
    start()
  File "/home/a/ais/stable-diffusion-webui/modules/launch_utils.py", line 460, in start
    import webui
  File "/home/a/ais/stable-diffusion-webui/webui.py", line 13, in <module>
    initialize.imports()
  File "/home/a/ais/stable-diffusion-webui/modules/initialize.py", line 34, in imports
    shared_init.initialize()
  File "/home/a/ais/stable-diffusion-webui/modules/shared_init.py", line 17, in initialize
    from modules import options, shared_options
  File "/home/a/ais/stable-diffusion-webui/modules/shared_options.py", line 3, in <module>
    from modules import localization, ui_components, shared_items, shared, interrogate, shared_gradio_themes
  File "/home/a/ais/stable-diffusion-webui/modules/interrogate.py", line 13, in <module>
    from modules import devices, paths, shared, lowvram, modelloader, errors
  File "/home/a/ais/stable-diffusion-webui/modules/devices.py", line 84, in <module>
    errors.run(enable_tf32, "Enabling TF32")
  File "/home/a/ais/stable-diffusion-webui/modules/errors.py", line 100, in run
    display(task, e)
  File "/home/a/ais/stable-diffusion-webui/modules/errors.py", line 68, in display
    te = traceback.TracebackException.from_exception(e)
  File "/usr/lib/python3.10/traceback.py", line 572, in from_exception
    return cls(type(exc), exc, exc.__traceback__, *args, **kwargs)
AttributeError: 'str' object has no attribute '__traceback__'

stansu commented 10 months ago

i have this error too. but restart computer fix it.

oh-bala commented 10 months ago

As this tested on Windows 11, and had tested with below cases, and finally realized the python package or virtualenv not set properly.

First try:

Use the sd.webui.zip release package and start run.bat => get RuntimeError: Torch is not able to use GPU
Launch cmd.exe and test torch.cuda.is_available() => get False
Test tourch.__version__ => cpu

Second try:

Enter WSL2, install pytorch
test torch.cuda.is_available() => get True
Test tourch.__version__ => cu117

Third try:

Install Anaconda
Launch Anaconda prompt and install pytorch and other dependencies
Enter sd.webui directory and execute run.bat => get RuntimeError: Torch is not able to use GPU

Relaunch Anaconda, this time, directly test torch.cuda.is_available() => get True

Among these tests, get the python executable path with cuda is available, say, import sys; print(sys.executable), then assign to set PYTHON= in webui-user.bat file.

josephrocca commented 10 months ago

FWIW, I get this "Torch is not able to use GPU" error for a 4090 machine with:

"Platform": "Linux-6.2.0-39-generic-x86_64-with-glibc2.35",
"nvidia_driver_version": "545.29.06",

But the same docker image works fine with on a 4090 machine with:

"Platform": "Linux-5.15.0-91-generic-x86_64-with-glibc2.35",
"nvidia_driver_version": "545.23.08",

Everything else is pretty much the same on both machines.

Not sure which (if any) of those is the cause. Note that I'm still on v1.6 though, and I'm guessing above comments are RE v1.7

JamingDE commented 9 months ago

ok but how do i fix this

ghost commented 8 months ago

this seems to work for me

# Download CUDA pin file
#wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
#sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600

# Download CUDA repository package
#wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda-repo-wsl-ubuntu-12-4-local_12.4.0-1_amd64.deb

'''

# Update package lists
sudo apt update

# Upgrade packages
sudo apt upgrade -y
sudo apt dist-upgrade -y
sudo apt full-upgrade -y

sudo apt install software-properties-common
sudo add-apt-repository ppa:ubuntu-toolchain-r/test

sudo dpkg --configure -

sudo apt update

sudo apt install gcc-12 g++-12 gcc-13 g++-13 -y

sudo dpkg --configure -

sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 12 --slave /usr/bin/g++ g++ /usr/bin/g++-12
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-13 13 --slave /usr/bin/g++ g++ /usr/bin/g++-13

sudo dpkg --configure -

sudo update-alternatives --config gcc

sudo dpkg --configure -

gcc --version

sudo apt install build-essential
sudo apt install libmpfr-dev libgmp3-dev libmpc-dev -y
wget http://ftp.gnu.org/gnu/gcc/gcc-13.2.0/gcc-13.2.0.tar.gz
#tar -xf gcc-13.2.0.tar.gz

sudo dpkg --configure -

cd gcc-13.2.0

sudo dpkg --configure -

./configure -v --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --prefix=/usr/local/gcc-13.2.0 --enable-checking=release --enable-languages=c,c++ --disable-multilib --program-suffix=-13.2.0

sudo dpkg --configure -

make -j3
sudo make install

sudo dpkg --configure -

/usr/local/gcc-13.2.0/bin/gcc-13.2.0 --version

sudo dpkg --configure -

# Install required packages
sudo apt install -y \
    libxkbcommon0=1.6.0-1 \
    fakeroot \
    libalgorithm-merge-perl \
    g++=4:13.2.0-7 \
    cpp-x86-64-linux-gnu=4:13.2.0-7 \
    g++-13>=13.2.0-11 \
    g++-x86-64-linux-gnu=4:13.2.0-7 \
    gcc-13>=13.2.0-11 \
    gcc=4:13.2.0-7 \
    gcc-x86-64-linux-gnu=4:13.2.0-7 \
    libglvnd0=1.7.0-1 \
    lto-disabled-list \
    openjdk-17-jre-headless=17.0.10+7-1~22.04.1 \
    libatk-wrapper-java-jni>=0.30.4-0ubuntu2 \
    cpp=4:13.2.0-7 \
    cpp-x86-64-linux-gnu=4:13.2.0-7 \
    g++-13>=13.2.0-11~ \
    g++-x86-64-linux-gnu=4:13.2.0-7 \ 
    gcc-13>=13.2.0-11~ \
    gcc=4:13.2.0-7 \
    cpp-x86-64-linux-gnu=4:13.2.0-7 \ 
    gcc-13>=13.2.0-11~ \
    gcc-x86-64-linux-gnu=4:13.2.0-7 \ 
    openjdk-17-jre-headless=17.0.10+7-1~22.04.1 \
    libatk-wrapper-java-jni>=0.30.4-0ubuntu2

'''

# Install CUDA repository
sudo dpkg -i cuda-repo-wsl-ubuntu-12-4-local_12.4.0-1_amd64

# Copy CUDA keyring
sudo cp /var/cuda-repo-wsl-ubuntu-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/

# Update package lists
sudo apt-get update

# Install CUDA toolkit
sudo apt-get -y install cuda-toolkit-12-4

# Install pycuda (optional)
#sudo pip3 install pycuda

AUTOMATIC1111 / stable-diffusion-webui