Torch is not able to use GPU

atagen commented 1 year ago

when attempting to launch automatic1111 UI I'm met with the following:

[x@y:~/Code/etc/nix-stable-diffusion/stable-diffusion-webui]$ LD_PRELOAD="/run/opengl-driver/lib/libcuda.so" HSA_OVERRIDE_GFX_VERSION=10.3.0 python
3 launch.py
Python 3.10.7 (main, Sep  5 2022, 13:12:31) [GCC 11.3.0]
Commit hash: 737eb28faca8be2bb996ee0930ec77d1f7ebd939
Traceback (most recent call last):
  File "/home/bolt/Code/etc/nix-stable-diffusion/stable-diffusion-webui/launch.py", line 205, in <module>
    prepare_enviroment()
  File "/home/bolt/Code/etc/nix-stable-diffusion/stable-diffusion-webui/launch.py", line 151, in prepare_enviroment
    run_python("import torch; assert torch.cuda.is_available(), 'Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check'")
  File "/home/bolt/Code/etc/nix-stable-diffusion/stable-diffusion-webui/launch.py", line 57, in run_python
    return run(f'"{python}" -c "{code}"', desc, errdesc)
  File "/home/bolt/Code/etc/nix-stable-diffusion/stable-diffusion-webui/launch.py", line 33, in run
    raise RuntimeError(message)
RuntimeError: Error running command.
Command: "/nix/store/wyhbl43ycqn43d08v5fqj1j6ynf7nz73-python3-3.10.7/bin/python3" -c "import torch; assert torch.cuda.is_available(), 'Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check'"
Error code: 1
stdout: <empty>
stderr: /nix/store/lvywargqhfhnmwhpk73zl2qy8qrbx0ql-python3.10-torch-1.12.1/lib/python3.10/site-packages/torch/cuda/__init__.py:83: UserWarning: HIP initialization: Unexpected error from hipGetDeviceCount(). Did you run some cuda functions before calling NumHipDevices() that might have already set an error? Error 101: hipErrorInvalidDevice (Triggered internally at  ../c10/hip/HIPFunctions.cpp:110.)
  return torch._C._cuda_getDeviceCount() > 0
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AssertionError: Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check

nvidia-smi correctly shows my card from within the same shell:

[x@y:~/Code/etc/nix-stable-diffusion/stable-diffusion-webui]$ nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.53       Driver Version: 525.53       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:07:00.0  On |                  N/A |
|  0%   44C    P5    22W / 170W |    769MiB / 12288MiB |     16%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

launching with the torch CUDA test skipped will launch, but leads to a plethora of errors while loading or attempting to generate anything, probably the only interesting one of which is:

/nix/store/lvywargqhfhnmwhpk73zl2qy8qrbx0ql-python3.10-torch-1.12.1/lib/python3.10/site-packages/torch/cuda/__init__.py:83: UserWarning: HIP initialization: Unexpected error from hipGetDeviceCount(). Did you run some cuda functions before calling NumHipDevices() that might have already set an error? Error 101: hipErrorInvalidDevice (Triggered internally at  ../c10/hip/HIPFunctions.cpp:110.)
  return torch._C._cuda_getDeviceCount() > 0
Warning: caught exception 'Unexpected error from hipGetDeviceCount(). Did you run some cuda functions before calling NumHipDevices() that might have already set an error? Error 101: hipErrorInvalidDevice', memory monitor disabled

my system is currently using the beta nvidia drivers (525 instead of 520), but I didn't have any better luck from switching back to stable.

please let me know if there's any further information I can provide/tests to run/etc to help.

gbtb commented 1 year ago

Hello, atagen :wave:

Can you try to launch another UI from this flake - InvokeAI - and see if it works ? It's more stable and user-friendly one, imo (and the one I'm using myself). It might give us some information about the origin of your issue.

Also could you explain why you used additional env variables before invocation of launch.py? I'm using AMD GPU, so don't have much knowledge about running this stuff with NVidia GPUs. Maybe you have links to the instructions you've used.

atagen commented 1 year ago

hello :)

unfortunately, InvokeAI throws the same:

/nix/store/lvywargqhfhnmwhpk73zl2qy8qrbx0ql-python3.10-torch-1.12.1/lib/python3.10/site-packages/torch/cuda/__init__.py:83: UserWarning: HIP initialization: Unexpected error from hipGetDeviceCount(). Did you run some cuda functions before calling NumHipDevices() that might have already set an error? Error 101: hipErrorInvalidDevice (Triggered internally at  ../c10/hip/HIPFunctions.cpp:110.)
  return torch._C._cuda_getDeviceCount() > 0

re: the additional env variables, they're random tidbits I picked up searching around NixOS, SD, CUDA, etc.. the former is trying to force Torch to see NixOS's CUDA, and I think the latter one is actually a fix for ROCM? they don't seem to have any bearing on the result whether present or not.

atagen commented 1 year ago

just noticed this - seems the nvidia override wasn't getting called after all:

https://github.com/gbtb/nix-stable-diffusion/blob/c9db788451a8a51b4fd64cad2b937c94e9667471/flake.nix#L203

I've corrected this to overlay_nvidia to match the torch overlay that enables CUDA - this appears to be chugging through a bunch of CUDA related stuff (whereas before it would drop me straight into the shell).

gbtb commented 1 year ago

Good eye :eyes: Classic copy-paste error. With lazy nature of Nix and lack of functional tests these errors could slip through. You can submit PR if you want, or I'll fix it myself tomorrow.

atagen commented 1 year ago

I'll be happy to submit one, just waiting for the whole thing to compile so I can confirm it works - none of the caches I use have these versions for some reason, perhaps something to do with CUDA being unfree.

atagen commented 1 year ago

success! for sanity's sake, I might make a second change to switch the nvidia torch and torchvision to their binary counterparts too; it looks like this already happens for AMD anyway.

gbtb / nix-stable-diffusion

Torch is not able to use GPU #14