Open scepterus opened 9 months ago
@derneuere I found the issue. CUDA was not seen correctly due to this part:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [all]
Once I removed that, and ran nvidia-smi CUDA showed the version that's installed. However, now I face another issue, where the CUDA installed in the docker is older that what I have on the host. So I get this:
05:13:14 [Q] ERROR Failed 'api.directory_watcher.face_scan_job' (illinois-asparagus-stream-tennis) - Error while calling cudaGetDevice(&the_device_id) in file /tmp/builds/dlib/dlib/cuda/gpu_data.cpp:204. code: 804, reason: forward compatibility was attempted on non supported HW : Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/django_q/worker.py", line 88, in worker
res = f(*task["args"], **task["kwargs"])
File "/code/api/directory_watcher.py", line 411, in face_scan_job
photo._extract_faces()
File "/code/api/models/photo.py", line 729, in _extract_faces
import face_recognition
File "/usr/local/lib/python3.10/dist-packages/face_recognition/__init__.py", line 7, in <module>
from .api import load_image_file, face_locations, batch_face_locations, face_landmarks, face_encodings, compare_faces, face_distance
File "/usr/local/lib/python3.10/dist-packages/face_recognition/api.py", line 26, in <module>
cnn_face_detector = dlib.cnn_face_detection_model_v1(cnn_face_detection_model)
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /tmp/builds/dlib/dlib/cuda/gpu_data.cpp:204. code: 804, reason: forward compatibility was attemp
You might want to update the guide and remove those deploy things if that's the case for everyone. Will try to update the CUDA in the container and see what happens.
UPDATE: Just found out this needs to be done in the docker file. So we'll need to figure this out for everyone. Maybe a check to see which CUDA is installed, then populate the version that's being pulled?
@derneuere Any chance to get this fixed? I do not want to go back to cpu if this will be fixed soon, but right now this thing is totally broken.
dlib is compiled against a specific version of cuda, which is in this case Cuda 11.7.1 with cudnn8
It complains that "forward compatibility" was attempted and failed. This means that the host system has likely old drivers. That could either be an issue, that the graphic card is old or the drivers are old.
Graphics card cannot be the reason as I develop on a system with a 1050ti max q, which works fine.
Please update the driver or change the deploy part. On my system I use
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
I can't make dlib compatible against multiple versions, compiling it during runtime will lead to half an hour start up time and replacing it with something more flexible is not doable for me atm due to time constraints.
As you can see in my previous comment, if I add that part to the compose, cuda is not detected inside the container. The error you see that I attached was when the host machine has cude 12, while the container has cuda 11. It still calls it forward compatibility.
Hmm, I will try to bump everything to CUDA 12. According to the docs, it should be backwards compatible. Let's see if that actually works.
Cool, let me know if I can help.
Alright I pushed a fix. Should be available in half an hour. Let me know if that fixes the issue for you :)
Is this on dev or stable?
Only on dev for now :)
Ah, can I pull just gpu-dev by adding -dev to it in the docker compose file?
Yes, works the same way as the other image :)
Sadly, I've been trying to download that image for 2 days now, it just hangs and times out. I need to restart and hope it fully downloads.
INFO:ownphotos:Can't extract face information on photo: photo INFO:ownphotos:HTTPConnectionPool(host='localhost', port=8005): Max retries exceeded with url: /face-locations (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f51db061ed0>: Failed to establish a new connection: [Errno 111] Connection refused'))
with the latest dev gpu.
[2023-11-04 16:38:56 +0000] [12097] [INFO] Autorestarting worker after current request. /usr/local/lib/python3.10/dist-packages/rest_framework/pagination.py:200: UnorderedObjectListWarning: Pagination may yield inconsistent results with an unordered object_list: <class 'api.models.person.Person'> QuerySet. paginator = self.django_paginator_class(queryset, page_size) [2023-11-04 16:38:57 +0000] [12097] [INFO] Worker exiting (pid: 12097) [2023-11-04 18:38:57 +0200] [16597] [INFO] Booting worker with pid: 16597 use SECRET_KEY from file /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0
@derneuere any idea how we move past this?
I can't reproduce this and I am pretty sure, that this issue is not on my side. Do other GPU accelerated images work for you?
Currently the only bug I can reproduce is #1056
Last time I tested the cuda test container it worked, let me verify that now.
== CUDA ==
==========
CUDA Version 12.2.2
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
Thu Nov 9 04:57:47 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1050 Off | 00000000:08:00.0 Off | N/A |
| 0% 47C P0 N/A / 70W | 0MiB / 2048MiB | 2% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Here's the output, looks like it is working inside the docker.
I added the parts in the docker compose file back like it says in the guide, and this is what I get now:
thumbnail: service starting
Traceback (most recent call last):
File "/code/service/face_recognition/main.py", line 4, in <module>
import face_recognition
File "/usr/local/lib/python3.10/dist-packages/face_recognition/__init__.py", line 7, in <module>
from .api import load_image_file, face_locations, batch_face_locations, face_landmarks, face_encodings, compare_faces, face_distance
File "/usr/local/lib/python3.10/dist-packages/face_recognition/api.py", line 26, in <module>
cnn_face_detector = dlib.cnn_face_detection_model_v1(cnn_face_detection_model)
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /tmp/builds/dlib/dlib/cuda/gpu_data.cpp:204. code: 999, reason: unknown error
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
When I connect to the container and do nvidia-smi it outputs correctly:
Thu Nov 9 07:11:58 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1050 Off | 00000000:08:00.0 Off | N/A |
| 0% 47C P0 N/A / 70W | 0MiB / 2048MiB | 2% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
@derneuere after the update last night to the backend, things changed. When I did a scan for new photos, it managed to extract stuff from them, but I get this error:
INFO:ownphotos:Can't extract face information on photo: /location/photo.png
INFO:ownphotos:HTTPConnectionPool(host='localhost', port=8005): Max retries exceeded with url: /face-locations (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc88e1979d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
Also, a few things like these:
[2023-12-01 06:24:35 +0000] [3773] [INFO] Autorestarting worker after current request.
[2023-12-01 06:24:36 +0000] [3773] [INFO] Worker exiting (pid: 3773)
[2023-12-01 08:24:36 +0200] [3785] [INFO] Booting worker with pid: 3785
use SECRET_KEY from file
You'll notice the top 2 are in GMT, and the last one is GMT+2. That might cause issues if you're comparing both, and setting a timeout based on the difference.
These messages repeat a few times. I hope this helps you narrow down the issues.
Side note:
INFO:ownphotos:Could not handle /location/IMG_20071010_150554_2629.jxl, because unable to call thumbnail
VipsForeignLoad: "/location//IMG_20071010_150554_2629.jxl" is not a known file format
Wasn't jxl fixed?
JXL is handled by thumbnail-service / imagemagick and not by vips. Can you look into the log files for face-service and thumbnail-service and post possible errors here?
Regarding the JXL, so why is it erroring out if it's not supposed to handle these files? as for the logs, can you be more specific? I found in the logs folder only face-recognition.log, and here's the output of it:
cat face_recognition.log
Traceback (most recent call last):
File "/code/service/face_recognition/main.py", line 1, in <module>
import face_recognition
File "/usr/local/lib/python3.10/dist-packages/face_recognition/__init__.py", line 7, in <module>
from .api import load_image_file, face_locations, batch_face_locations, face_landmarks, face_encodings, compare_faces, face_distance
File "/usr/local/lib/python3.10/dist-packages/face_recognition/api.py", line 26, in <module>
cnn_face_detector = dlib.cnn_face_detection_model_v1(cnn_face_detection_model)
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /tmp/builds/dlib/dlib/cuda/gpu_data.cpp:204. code: 999, reason: unknown error
Here's what I get when loading the latest backend:
/usr/local/lib/python3.10/dist-packages/picklefield/fields.py:78: RuntimeWarning: Pickled model instance's Django version 4.2.7 does not match the current version 4.2.8.
return loads(value)
/usr/local/lib/python3.10/dist-packages/rest_framework/pagination.py:200: UnorderedObjectListWarning: Pagination may yield inconsistent results with an unordered object_list: <class 'api.models.person.Person'> QuerySet.
paginator = self.django_paginator_class(queryset, page_size)
Unauthorized: /api/albums/date/list/
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
Still the same error, that the backend can't find the gpu. I think this has something to do with docker or pytorch and not with librephotos. Can you look for similar issues and check if other containers which support gpu acceleration work?
If I see CUDA correctly in the test docker from nvidia, is that enough to rule out the infrastructure? Or is there another test that will definitely prove this?
Forget what I said, I just ran: nvidia-smi inside the backend, and it works, so the container can reach the GPU. It must be an issue in the software.
Traceback (most recent call last):
File "/code/service/face_recognition/main.py", line 1, in <module>
import face_recognition
File "/usr/local/lib/python3.10/dist-packages/face_recognition/__init__.py", line 7, in <module>
from .api import load_image_file, face_locations, batch_face_locations, face_landmarks, face_encodings, compare_faces, face_distance
File "/usr/local/lib/python3.10/dist-packages/face_recognition/api.py", line 26, in <module>
cnn_face_detector = dlib.cnn_face_detection_model_v1(cnn_face_detection_model)
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /tmp/builds/dlib/dlib/cuda/gpu_data.cpp:204. code: 999, reason: unknown error
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
here's my env:
backend:
image: reallibrephotos/librephotos-gpu:dev
container_name: backend
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities:
- gpu
yet as mentioned CUDA is seen.
==========
== CUDA ==
==========
CUDA Version 12.1.0
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md
Wed Dec 20 07:19:30 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
I added nvidia-smi to the start of entrypoint, because bing co-pilot suggested I run smi before any command. Here's the output. I have created this pull request: https://github.com/LibrePhotos/librephotos-docker/pull/113 so we can update that going forward. Also, see the deprecation notice, I think we need to keep on the latest image for CUDA for it to function properly.
I think this here is the correct issue from pytorch. https://github.com/pytorch/pytorch/issues/49081 I added nvidia modprobe to the container. Lets see if that works.
Traceback (most recent call last):
File "/code/service/face_recognition/main.py", line 1, in <module>
import face_recognition
File "/usr/local/lib/python3.10/dist-packages/face_recognition/__init__.py", line 7, in <module>
from .api import load_image_file, face_locations, batch_face_locations, face_landmarks, face_encodings, compare_faces, face_distance
File "/usr/local/lib/python3.10/dist-packages/face_recognition/api.py", line 26, in <module>
cnn_face_detector = dlib.cnn_face_detection_model_v1(cnn_face_detection_model)
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /tmp/builds/dlib/dlib/cuda/gpu_data.cpp:204. code: 999, reason: unknown error
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
Still this. Are you initiating the CUDA drivers and visible devices before the code? It does not look like my pull request was merged, so I could see the result of nvidia-smi in these logs.
After the modprobe and the pull request merge, still the same issue. I get the CUDA info at the start, but the error still shows up. Don't we need CUDA drivers as well? And I really think we need the latest CUDA image, because the info of my CUDA is 12.2 and the one in the container is 12.1 and is deprecated.
Should be backwards compatible and we need this version as pytorch has the same version. I also have a 1050ti with CUDA 12.2, drivers with the version 535.129.03 and it works.
CUDA drivers should be installed on the host system. The docker image needs the image from nvidia, which we already use. Can you check if there are different drivers available for your system?
My system:
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1050 ... Off | 00000000:01:00.0 Off | N/A |
| N/A 41C P8 N/A / ERR! | 96MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
My env looks like this:
backend:
image: reallibrephotos/librephotos-gpu:${tag}
container_name: backend
restart: unless-stopped
volumes:
- ${scanDirectory}:/data
- ${data}/protected_media:/protected_media
- ${data}/logs:/logs
- ${data}/cache:/root/.cache
environment:
- SECRET_KEY=${shhhhKey:-}
- BACKEND_HOST=backend
- ADMIN_EMAIL=${adminEmail:-}
- ADMIN_USERNAME=${userName:-}
- ADMIN_PASSWORD=${userPass:-}
- DB_BACKEND=postgresql
- DB_NAME=${dbName}
- DB_USER=${dbUser}
- DB_PASS=${dbPass}
- DB_HOST=${dbHost}
- DB_PORT=5432
- MAPBOX_API_KEY=${mapApiKey:-}
- WEB_CONCURRENCY=${gunniWorkers:-1}
- SKIP_PATTERNS=${skipPatterns:-}
- ALLOW_UPLOAD=${allowUpload:-false}
- CSRF_TRUSTED_ORIGINS=${csrfTrustedOrigins:-}
- DEBUG=0
- HEAVYWEIGHT_PROCESS=${HEAVYWEIGHT_PROCESS:-}
depends_on:
db:
condition: service_healthy
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
I added export CUDA_VISIBLE_DEVICES=0 to the entrypoint.sh, maybe that will make a difference.
Here's my output inside the container:
*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md
Fri Dec 22 00:31:46 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1050 Off | 00000000:08:00.0 Off | N/A |
| 0% 41C P0 N/A / 70W | 0MiB / 2048MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
I added export CUDA_VISIBLE_DEVICES=0
That would just mean no devices would be registered.
Yours even has an error in detecting the Watt limit. Is that from inside the container or from the host system?
export CUDA_VISIBLE_DEVICES=0 means, that the 0th devices will be visible, which is in your list your only GPU. Yeah, probably has something to do with it being a laptop, but it still works :)
that the 0th devices will be visible
The naming is a bit confusing then.
but it still works :)
Question is, maybe it's such a unique case that it works when you test, but with a desktop one it requires something different?
Added it manually on my entrypoint, did not help. I made a quick script to check the host: https://github.com/LibrePhotos/librephotos-docker/pull/115 My host passes all the checks.
I used your configuration for the GPU. This also works on my machine.
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities:
- gpu
I also executed your HostCuda script and it passed:
CUDA-capable GPU detected.
x86_64
Linux version is supported.
GCC is installed.
Kernel headers and development packages are installed.
NVIDIA binary GPU driver is installed.
Docker is installed.
Unable to find image 'nvidia/cuda:12.2.2-runtime-ubuntu20.04' locally
12.2.2-runtime-ubuntu20.04: Pulling from nvidia/cuda
96d54c3075c9: Pull complete
db26cf78ae4f: Pull complete
5adc7ab504d3: Pull complete
e4f230263527: Pull complete
95e3f492d47e: Pull complete
35dd1979297e: Pull complete
39a2c88664b3: Pull complete
d8f6b6cd09da: Pull complete
fe19bbed4a4a: Pull complete
Digest: sha256:7df325b76ef5087ac512a6128e366b7043ad8db6388c19f81944a28cd4157368
Status: Downloaded newer image for nvidia/cuda:12.2.2-runtime-ubuntu20.04
NVIDIA Container Toolkit is installed.
Can you try this suggested fix on your host machine? https://github.com/pytorch/pytorch/issues/49081#issuecomment-1385958634
Do you mean this:
sudo modprobe -r nvidia_uvm && sudo modprobe nvidia_uvm
Because that returns: modprobe: FATAL: Module nvidia_uvm not found.
So my host is set up like yours (at least from the prerequisites tests) and the compose file is the same. What else could be different? nvidia_uvm is not one of the prerequisites from the nvidia documentation.
Alright, just execute the second part, sudo modprobe nvidia_uvm
, the first part is just for removing an already existing nvidia_uvm module.
I am not basing the debug commands from the documentation as it is usually not complete, but from the issue on github in pytorch, which usually provides better pointers on how to fix the error.
I just use kubuntu 22.04, do you use something unique like arch?
sudo modprobe nvidia_uvm
modprobe: FATAL: Module nvidia_uvm not found in directory /lib/modules/6.1.55-production+truenas
I just use kubuntu 22.04, do you use something unique like arch?
Nope, Debian bookworm.
This sounds like the gpu drivers are not actually correctly installed according to stackoverflow: https://askubuntu.com/questions/1413512/syslog-error-modprobe-fatal-module-nvidia-not-found-in-directory-lib-module
🐛 Bug Report
log
files📝 Description of issue:
When scanning in the new GPU docker variant, I get the following errors:
Also attached. message.txt
🔁 How can we reproduce it:
Have a docker with nvidia gpu, in my case 1050, it does not get recognized in server stats.
Please provide additional information: