Running Locally : Torch is not able to use GPU

FyzzLive commented 6 months ago

Running Locally : Torch is not able to use GPU Pulled the repo, edited the env, added my own provisioning script, and ran docker compose up Received this error after the build process as it was starting:docker RuntimeError: Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check

zxcjhg commented 6 months ago

just add runtime: nvidia to the docker-compose.yaml

FyzzLive commented 6 months ago

Now getting services.runtime must be a mapping

    #docker-compose.yaml
    runtime: nvidia
    environment:
        # Don't enclose values in quotes
        - DIRECT_ADDRESS=${DIRECT_ADDRESS:-127.0.0.1}
        - DIRECT_ADDRESS_GET_WAN=${DIRECT_ADDRESS_GET_WAN:-false}
        ...

9Dave9 commented 2 months ago

Any update on this?

Running docker compose up from a freshly cloned repo and getting the same issue.

System: Linux: Ubuntu 20.04 Nvidia RTX 2070 - seems to spin up everything except for Automatic1111

Error when running compose:


supervisor-1  | ==> /var/log/supervisor/webui.log <==
supervisor-1  | INFO:     Shutting down
supervisor-1  | INFO:     Waiting for application shutdown.
supervisor-1  | INFO:     Application shutdown complete.
supervisor-1  | INFO:     Finished server process [407]
supervisor-1  | Starting A1111 SD Web UI...
supervisor-1  | Python 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
supervisor-1  | Version: v1.9.3
supervisor-1  | Commit hash: 1c0a0c4c26f78c32095ebc7f8af82f5c04fca8c0
supervisor-1  | 2024-04-23 11:34:28,438 INFO exited: webui (exit status 1; not expected)
supervisor-1  | 
supervisor-1  | ==> /var/log/supervisor/supervisor.log <==
supervisor-1  | 2024-04-23 11:34:28,438 INFO exited: webui (exit status 1; not expected)
supervisor-1  | 
supervisor-1  | ==> /var/log/supervisor/webui.log <==
supervisor-1  | Traceback (most recent call last):
supervisor-1  |   File "/workspace/stable-diffusion-webui/launch.py", line 48, in <module>
supervisor-1  |     main()
supervisor-1  |   File "/workspace/stable-diffusion-webui/launch.py", line 39, in main
supervisor-1  |     prepare_environment()
supervisor-1  |   File "/workspace/stable-diffusion-webui/modules/launch_utils.py", line 386, in prepare_environment
supervisor-1  |     raise RuntimeError(
supervisor-1  | RuntimeError: Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check
supervisor-1  | 2024-04-23 11:34:29,109 INFO spawned: 'webui' with pid 876

Changing the docker-compose.yml file and adding runtime: nvidia seems to bomb too as docker doesn't seem to recognise this: Error response from daemon: unknown or invalid runtime name: nvidia

=====

This seems to be an issue with running Auto web UI as folks are talking about this there as well... Proposed solution here: https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/1742#issuecomment-1297210099

Although I tried adding this in the: build/COPY_ROOT/opt/ai-dock/bin/update-webui.sh file, before Web UI is started, and it doesn't really do anything.

=== rmmod, lsmod, modprobe aren't packaged with the linux/amd64 platform. Jumped into the running container and nothing sitting in /sbin.

So installed sudo apt install linux-generic -y manually in the container(for testing purposes).. but rrmod nvidia-uvm isn't possible because it's in use at that point.

======

I have downloaded and run a different tag as well: https://github.com/ai-dock/stable-diffusion-webui/pkgs/container/stable-diffusion-webui/205862819?tag=cuda-12.1.1-runtime-22.04-v1.8.0

Exact same behaviour unfortunately.

======

Last ditch attempt.. set AUTO_UPDATE=true in the hope that something would have been fixed on the latest version of AUTOMATIC1111 .. ... nope :(

=======

Basically at a dead end unless I want to run this in "CPU mode" .. which I don't.

djsiroky commented 1 month ago

Per this Docker article about giving access to the GPU, I added the following to docker-compose.yaml after getting the same error and things seem to be working now:

    devices:
      - "/dev/dri:/dev/dri"
      # For AMD GPU
      #- "/dev/kfd:/dev/kfd"
+
+   deploy:
+     resources:
+       reservations:
+         devices:
+           - driver: nvidia
+             count: 1
+             capabilities: [gpu]

    volumes:
      # Workspace
      - ./workspace:${WORKSPACE:-/workspace/}:rshared

Stan-Stani commented 1 month ago

A question I have is why does it work in the cloud but not locally.

Why doesn't it need to have an explicit GPU configuration when in the cloud?

Stan-Stani commented 1 month ago

Per this Docker article about giving access to the GPU, I added the following to docker-compose.yaml after getting the same error and things seem to be working now:
    devices:
      - "/dev/dri:/dev/dri"
      # For AMD GPU
      #- "/dev/kfd:/dev/kfd"
+
+   deploy:
+     resources:
+       reservations:
+         devices:
+           - driver: nvidia
+             count: 1
+             capabilities: [gpu]

    volumes:
      # Workspace
      - ./workspace:${WORKSPACE:-/workspace/}:rshared

The base image's docker-compose.yml is perhaps worth looking at: https://github.com/ai-dock/base-image/blob/main/docker-compose.yaml

robballantyne commented 1 month ago

A question I have is why does it work in the cloud but not locally.

Why doesn't it need to have an explicit GPU configuration when in the cloud?

It really depends on the cloud provider. These images are particularly suited to providers where the user is given a single docker image to run in a preconfigured docker environment. In that case all you need to set is the envs and ports - Everything else is done for you.

The only downside of this, particularly if you subscribe to the 'docker way' of one process per container, is that there's no service orchestration so lots of services/apps get bundled on the one image but with supervisord managing rather than docker directly.

I will hopefully give more attention to local use soon when I am able to test in full virtual machine environments.

ai-dock / stable-diffusion-webui

Running Locally : Torch is not able to use GPU #5