LibrePhotos / librephotos

A self-hosted open source photo management service. This is the repository of the backend.
MIT License
6.67k stars 290 forks source link

GPU variant has issues recognizing the GPU. #1035

Open scepterus opened 9 months ago

scepterus commented 9 months ago

🐛 Bug Report

📝 Description of issue:

When scanning in the new GPU docker variant, I get the following errors:

/usr/local/lib/python3.10/dist-packages/rest_framework/pagination.py:200: UnorderedObjectListWarning: Pagination may yield inconsistent results with an unordered object_list: <class 'api.models.person.Person'> QuerySet.
  paginator = self.django_paginator_class(queryset, page_size)
  return torch._C._cuda_getDeviceCount() > 0
  File "/usr/local/lib/python3.10/dist-packages/django_q/worker.py", line 88, in worker
    res = f(task["args"], **task["kwargs"])
  File "/code/api/directory_watcher.py", line 411, in face_scan_job
    photo._extract_faces()
  File "/code/api/models/photo.py", line 729, in _extract_faces
    import face_recognition
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/init.py", line 7, in <module>
    from .api import load_image_file, face_locations, batch_face_locations, face_landmarks, face_encodings, compare_faces, face_distance
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/api.py", line 26, in <module>
    cnn_face_detector = dlib.cnn_face_detection_model_v1(cnn_face_detection_model)

Also attached. message.txt

🔁 How can we reproduce it:

Have a docker with nvidia gpu, in my case 1050, it does not get recognized in server stats.

Please provide additional information:

scepterus commented 9 months ago

@derneuere I found the issue. CUDA was not seen correctly due to this part:

deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [all]

Once I removed that, and ran nvidia-smi CUDA showed the version that's installed. However, now I face another issue, where the CUDA installed in the docker is older that what I have on the host. So I get this:

05:13:14 [Q] ERROR Failed 'api.directory_watcher.face_scan_job' (illinois-asparagus-stream-tennis) - Error while calling cudaGetDevice(&the_device_id) in file /tmp/builds/dlib/dlib/cuda/gpu_data.cpp:204. code: 804, reason: forward compatibility was attempted on non supported HW : Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/django_q/worker.py", line 88, in worker
    res = f(*task["args"], **task["kwargs"])
  File "/code/api/directory_watcher.py", line 411, in face_scan_job
    photo._extract_faces()
  File "/code/api/models/photo.py", line 729, in _extract_faces
    import face_recognition
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/__init__.py", line 7, in <module>
    from .api import load_image_file, face_locations, batch_face_locations, face_landmarks, face_encodings, compare_faces, face_distance
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/api.py", line 26, in <module>
    cnn_face_detector = dlib.cnn_face_detection_model_v1(cnn_face_detection_model)
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /tmp/builds/dlib/dlib/cuda/gpu_data.cpp:204. code: 804, reason: forward compatibility was attemp

You might want to update the guide and remove those deploy things if that's the case for everyone. Will try to update the CUDA in the container and see what happens.

scepterus commented 9 months ago

UPDATE: Just found out this needs to be done in the docker file. So we'll need to figure this out for everyone. Maybe a check to see which CUDA is installed, then populate the version that's being pulled?

scepterus commented 8 months ago

@derneuere Any chance to get this fixed? I do not want to go back to cpu if this will be fixed soon, but right now this thing is totally broken.

derneuere commented 8 months ago

dlib is compiled against a specific version of cuda, which is in this case Cuda 11.7.1 with cudnn8

It complains that "forward compatibility" was attempted and failed. This means that the host system has likely old drivers. That could either be an issue, that the graphic card is old or the drivers are old.

Graphics card cannot be the reason as I develop on a system with a 1050ti max q, which works fine.

Please update the driver or change the deploy part. On my system I use

      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

I can't make dlib compatible against multiple versions, compiling it during runtime will lead to half an hour start up time and replacing it with something more flexible is not doable for me atm due to time constraints.

scepterus commented 8 months ago

As you can see in my previous comment, if I add that part to the compose, cuda is not detected inside the container. The error you see that I attached was when the host machine has cude 12, while the container has cuda 11. It still calls it forward compatibility.

derneuere commented 8 months ago

Hmm, I will try to bump everything to CUDA 12. According to the docs, it should be backwards compatible. Let's see if that actually works.

scepterus commented 8 months ago

Cool, let me know if I can help.

derneuere commented 8 months ago

Alright I pushed a fix. Should be available in half an hour. Let me know if that fixes the issue for you :)

scepterus commented 8 months ago

Is this on dev or stable?

derneuere commented 8 months ago

Only on dev for now :)

scepterus commented 8 months ago

Ah, can I pull just gpu-dev by adding -dev to it in the docker compose file?

derneuere commented 8 months ago

Yes, works the same way as the other image :)

scepterus commented 8 months ago

Sadly, I've been trying to download that image for 2 days now, it just hangs and times out. I need to restart and hope it fully downloads.

scepterus commented 8 months ago

INFO:ownphotos:Can't extract face information on photo: photo INFO:ownphotos:HTTPConnectionPool(host='localhost', port=8005): Max retries exceeded with url: /face-locations (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f51db061ed0>: Failed to establish a new connection: [Errno 111] Connection refused'))

with the latest dev gpu.

[2023-11-04 16:38:56 +0000] [12097] [INFO] Autorestarting worker after current request. /usr/local/lib/python3.10/dist-packages/rest_framework/pagination.py:200: UnorderedObjectListWarning: Pagination may yield inconsistent results with an unordered object_list: <class 'api.models.person.Person'> QuerySet. paginator = self.django_paginator_class(queryset, page_size) [2023-11-04 16:38:57 +0000] [12097] [INFO] Worker exiting (pid: 12097) [2023-11-04 18:38:57 +0200] [16597] [INFO] Booting worker with pid: 16597 use SECRET_KEY from file /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0

scepterus commented 7 months ago

@derneuere any idea how we move past this?

derneuere commented 7 months ago

I can't reproduce this and I am pretty sure, that this issue is not on my side. Do other GPU accelerated images work for you?

Currently the only bug I can reproduce is #1056

scepterus commented 7 months ago

Last time I tested the cuda test container it worked, let me verify that now.

scepterus commented 7 months ago
== CUDA ==
==========
CUDA Version 12.2.2
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
Thu Nov  9 04:57:47 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1050        Off | 00000000:08:00.0 Off |                  N/A |
|  0%   47C    P0              N/A /  70W |      0MiB /  2048MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Here's the output, looks like it is working inside the docker.

scepterus commented 7 months ago

I added the parts in the docker compose file back like it says in the guide, and this is what I get now:

thumbnail: service starting
Traceback (most recent call last):
  File "/code/service/face_recognition/main.py", line 4, in <module>
    import face_recognition
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/__init__.py", line 7, in <module>
    from .api import load_image_file, face_locations, batch_face_locations, face_landmarks, face_encodings, compare_faces, face_distance
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/api.py", line 26, in <module>
    cnn_face_detector = dlib.cnn_face_detection_model_v1(cnn_face_detection_model)
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /tmp/builds/dlib/dlib/cuda/gpu_data.cpp:204. code: 999, reason: unknown error
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)

When I connect to the container and do nvidia-smi it outputs correctly:

Thu Nov  9 07:11:58 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1050        Off | 00000000:08:00.0 Off |                  N/A |
|  0%   47C    P0              N/A /  70W |      0MiB /  2048MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
scepterus commented 7 months ago

@derneuere after the update last night to the backend, things changed. When I did a scan for new photos, it managed to extract stuff from them, but I get this error:

INFO:ownphotos:Can't extract face information on photo: /location/photo.png
INFO:ownphotos:HTTPConnectionPool(host='localhost', port=8005): Max retries exceeded with url: /face-locations (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc88e1979d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
scepterus commented 7 months ago

Also, a few things like these:

[2023-12-01 06:24:35 +0000] [3773] [INFO] Autorestarting worker after current request.
[2023-12-01 06:24:36 +0000] [3773] [INFO] Worker exiting (pid: 3773)
[2023-12-01 08:24:36 +0200] [3785] [INFO] Booting worker with pid: 3785
use SECRET_KEY from file

You'll notice the top 2 are in GMT, and the last one is GMT+2. That might cause issues if you're comparing both, and setting a timeout based on the difference.

These messages repeat a few times. I hope this helps you narrow down the issues.

Side note:

INFO:ownphotos:Could not handle /location/IMG_20071010_150554_2629.jxl, because unable to call thumbnail
  VipsForeignLoad: "/location//IMG_20071010_150554_2629.jxl" is not a known file format

Wasn't jxl fixed?

derneuere commented 7 months ago

JXL is handled by thumbnail-service / imagemagick and not by vips. Can you look into the log files for face-service and thumbnail-service and post possible errors here?

scepterus commented 7 months ago

Regarding the JXL, so why is it erroring out if it's not supposed to handle these files? as for the logs, can you be more specific? I found in the logs folder only face-recognition.log, and here's the output of it:

cat face_recognition.log 
Traceback (most recent call last):
  File "/code/service/face_recognition/main.py", line 1, in <module>
    import face_recognition
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/__init__.py", line 7, in <module>
    from .api import load_image_file, face_locations, batch_face_locations, face_landmarks, face_encodings, compare_faces, face_distance
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/api.py", line 26, in <module>
    cnn_face_detector = dlib.cnn_face_detection_model_v1(cnn_face_detection_model)
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /tmp/builds/dlib/dlib/cuda/gpu_data.cpp:204. code: 999, reason: unknown error
scepterus commented 6 months ago

Here's what I get when loading the latest backend:

/usr/local/lib/python3.10/dist-packages/picklefield/fields.py:78: RuntimeWarning: Pickled model instance's Django version 4.2.7 does not match the current version 4.2.8.
  return loads(value)
/usr/local/lib/python3.10/dist-packages/rest_framework/pagination.py:200: UnorderedObjectListWarning: Pagination may yield inconsistent results with an unordered object_list: <class 'api.models.person.Person'> QuerySet.
  paginator = self.django_paginator_class(queryset, page_size)

Unauthorized: /api/albums/date/list/

/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
derneuere commented 6 months ago

Still the same error, that the backend can't find the gpu. I think this has something to do with docker or pytorch and not with librephotos. Can you look for similar issues and check if other containers which support gpu acceleration work?

scepterus commented 6 months ago

If I see CUDA correctly in the test docker from nvidia, is that enough to rule out the infrastructure? Or is there another test that will definitely prove this?

scepterus commented 6 months ago

Forget what I said, I just ran: nvidia-smi inside the backend, and it works, so the container can reach the GPU. It must be an issue in the software.

scepterus commented 6 months ago
Traceback (most recent call last):
  File "/code/service/face_recognition/main.py", line 1, in <module>
    import face_recognition
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/__init__.py", line 7, in <module>
    from .api import load_image_file, face_locations, batch_face_locations, face_landmarks, face_encodings, compare_faces, face_distance
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/api.py", line 26, in <module>
    cnn_face_detector = dlib.cnn_face_detection_model_v1(cnn_face_detection_model)
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /tmp/builds/dlib/dlib/cuda/gpu_data.cpp:204. code: 999, reason: unknown error
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0

here's my env:

  backend:
    image: reallibrephotos/librephotos-gpu:dev
    container_name: backend
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities:
                - gpu

yet as mentioned CUDA is seen.

scepterus commented 6 months ago
==========
== CUDA ==
==========
CUDA Version 12.1.0
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
    https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md
Wed Dec 20 07:19:30 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |

I added nvidia-smi to the start of entrypoint, because bing co-pilot suggested I run smi before any command. Here's the output. I have created this pull request: https://github.com/LibrePhotos/librephotos-docker/pull/113 so we can update that going forward. Also, see the deprecation notice, I think we need to keep on the latest image for CUDA for it to function properly.

derneuere commented 6 months ago

I think this here is the correct issue from pytorch. https://github.com/pytorch/pytorch/issues/49081 I added nvidia modprobe to the container. Lets see if that works.

scepterus commented 6 months ago
Traceback (most recent call last):
  File "/code/service/face_recognition/main.py", line 1, in <module>
    import face_recognition
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/__init__.py", line 7, in <module>
    from .api import load_image_file, face_locations, batch_face_locations, face_landmarks, face_encodings, compare_faces, face_distance
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/api.py", line 26, in <module>
    cnn_face_detector = dlib.cnn_face_detection_model_v1(cnn_face_detection_model)
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /tmp/builds/dlib/dlib/cuda/gpu_data.cpp:204. code: 999, reason: unknown error
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0

Still this. Are you initiating the CUDA drivers and visible devices before the code? It does not look like my pull request was merged, so I could see the result of nvidia-smi in these logs.

scepterus commented 6 months ago

After the modprobe and the pull request merge, still the same issue. I get the CUDA info at the start, but the error still shows up. Don't we need CUDA drivers as well? And I really think we need the latest CUDA image, because the info of my CUDA is 12.2 and the one in the container is 12.1 and is deprecated.

derneuere commented 6 months ago

Should be backwards compatible and we need this version as pytorch has the same version. I also have a 1050ti with CUDA 12.2, drivers with the version 535.129.03 and it works.

CUDA drivers should be installed on the host system. The docker image needs the image from nvidia, which we already use. Can you check if there are different drivers available for your system?

My system:

| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1050 ...    Off | 00000000:01:00.0 Off |                  N/A |
| N/A   41C    P8              N/A / ERR! |     96MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

My env looks like this:

  backend:
    image: reallibrephotos/librephotos-gpu:${tag}
    container_name: backend
    restart: unless-stopped
    volumes:
      - ${scanDirectory}:/data
      - ${data}/protected_media:/protected_media
      - ${data}/logs:/logs
      - ${data}/cache:/root/.cache
    environment:
      - SECRET_KEY=${shhhhKey:-}
      - BACKEND_HOST=backend
      - ADMIN_EMAIL=${adminEmail:-}
      - ADMIN_USERNAME=${userName:-}
      - ADMIN_PASSWORD=${userPass:-}
      - DB_BACKEND=postgresql
      - DB_NAME=${dbName}
      - DB_USER=${dbUser}
      - DB_PASS=${dbPass}
      - DB_HOST=${dbHost}
      - DB_PORT=5432
      - MAPBOX_API_KEY=${mapApiKey:-}
      - WEB_CONCURRENCY=${gunniWorkers:-1}
      - SKIP_PATTERNS=${skipPatterns:-}
      - ALLOW_UPLOAD=${allowUpload:-false}
      - CSRF_TRUSTED_ORIGINS=${csrfTrustedOrigins:-}
      - DEBUG=0
      - HEAVYWEIGHT_PROCESS=${HEAVYWEIGHT_PROCESS:-}
    depends_on:
      db:
        condition: service_healthy
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

I added export CUDA_VISIBLE_DEVICES=0 to the entrypoint.sh, maybe that will make a difference.

scepterus commented 6 months ago

Here's my output inside the container:

*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
    https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md
Fri Dec 22 00:31:46 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1050        Off | 00000000:08:00.0 Off |                  N/A |
|  0%   41C    P0              N/A /  70W |      0MiB /  2048MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I added export CUDA_VISIBLE_DEVICES=0

That would just mean no devices would be registered.

scepterus commented 6 months ago

image Yours even has an error in detecting the Watt limit. Is that from inside the container or from the host system?

derneuere commented 6 months ago

export CUDA_VISIBLE_DEVICES=0 means, that the 0th devices will be visible, which is in your list your only GPU. Yeah, probably has something to do with it being a laptop, but it still works :)

scepterus commented 6 months ago

that the 0th devices will be visible

The naming is a bit confusing then.

but it still works :)

Question is, maybe it's such a unique case that it works when you test, but with a desktop one it requires something different?

scepterus commented 6 months ago

Added it manually on my entrypoint, did not help. I made a quick script to check the host: https://github.com/LibrePhotos/librephotos-docker/pull/115 My host passes all the checks.

derneuere commented 6 months ago

I used your configuration for the GPU. This also works on my machine.

runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities:
                - gpu

I also executed your HostCuda script and it passed:

CUDA-capable GPU detected.
x86_64
Linux version is supported.
GCC is installed.
Kernel headers and development packages are installed.
NVIDIA binary GPU driver is installed.
Docker is installed.
Unable to find image 'nvidia/cuda:12.2.2-runtime-ubuntu20.04' locally
12.2.2-runtime-ubuntu20.04: Pulling from nvidia/cuda
96d54c3075c9: Pull complete 
db26cf78ae4f: Pull complete 
5adc7ab504d3: Pull complete 
e4f230263527: Pull complete 
95e3f492d47e: Pull complete 
35dd1979297e: Pull complete 
39a2c88664b3: Pull complete 
d8f6b6cd09da: Pull complete 
fe19bbed4a4a: Pull complete 
Digest: sha256:7df325b76ef5087ac512a6128e366b7043ad8db6388c19f81944a28cd4157368
Status: Downloaded newer image for nvidia/cuda:12.2.2-runtime-ubuntu20.04
NVIDIA Container Toolkit is installed.

Can you try this suggested fix on your host machine? https://github.com/pytorch/pytorch/issues/49081#issuecomment-1385958634

scepterus commented 6 months ago

Do you mean this:

sudo modprobe -r nvidia_uvm && sudo modprobe nvidia_uvm

Because that returns: modprobe: FATAL: Module nvidia_uvm not found.

So my host is set up like yours (at least from the prerequisites tests) and the compose file is the same. What else could be different? nvidia_uvm is not one of the prerequisites from the nvidia documentation.

derneuere commented 6 months ago

Alright, just execute the second part, sudo modprobe nvidia_uvm, the first part is just for removing an already existing nvidia_uvm module.

I am not basing the debug commands from the documentation as it is usually not complete, but from the issue on github in pytorch, which usually provides better pointers on how to fix the error.

I just use kubuntu 22.04, do you use something unique like arch?

scepterus commented 6 months ago

sudo modprobe nvidia_uvm modprobe: FATAL: Module nvidia_uvm not found in directory /lib/modules/6.1.55-production+truenas

I just use kubuntu 22.04, do you use something unique like arch?

Nope, Debian bookworm.

derneuere commented 6 months ago

This sounds like the gpu drivers are not actually correctly installed according to stackoverflow: https://askubuntu.com/questions/1413512/syslog-error-modprobe-fatal-module-nvidia-not-found-in-directory-lib-module