CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE

PTRFRLL / nv-docker-trex

Mine crypto using your Unraid server

46 stars 14 forks source link

CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE #41

Closed ksnell88 closed 2 years ago

ksnell88 commented 2 years ago

Is there a configuration change required after v3.5 and beyond? I received the below error with the latest version on Dockerhub. I built 3.3 from source here on Github and had no errors so it seems to be something to do with the new image it seems.

OS: Ubuntu Server 20.04 LTS Nvidia Driver: 460.91.03 CUDA Version: 11.2

Happy to provide any other information as needed.

ERROR: Can't start T-Rex, can't initialize CUDA engine, cuda exception: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE. Is NVIDIA driver installed?

2021-11-18_13-28_trex error

PTRFRLL commented 2 years ago

T-rex updated something starting in 0.24.5 that introduced some new issues with nvidia/CUDA. Can you try running nvidia-smi from within the container?

ksnell88 commented 2 years ago

Docker Compose can't even keep the container up to try. After doing docker-compose up -d I had the below output when trying to get inside the container. Seems to be essentially stuck in a boot loop?

ks@docker:~/docker/trex$ docker exec -it trex bash
Error response from daemon: Container 347af0c238f77c46ff9c64e9f2b14e33e72463d7116484666004074e57b86e83 is restarting, wait until the container is running

PTRFRLL commented 2 years ago

Hmm. Do you have the --runtime=nvidia flag set?

If you're building from source, you could try changing the base image to the CUDA version that matches yours: FROM 11.2.1-base-ubuntu18.04

ksnell88 commented 2 years ago

It seems like this might be a driver/CUDA version issue along with the adding of libnvidia-ml-dev to the Dockerfile from Issue 31 >> Can't load NVML library · Issue #31 · PTRFRLL/nv-docker-trex.

I tried building with many CUDA versions with no success.
If I commented out wget libnvidia-ml-dev \ and replaced with wget \ in the Dockerfile it would build and run successfully, but of course then I get the output error which appears to be from issue 31.
I then updated drivers to v470.82.01 CUDA 11.4 (v495 and CUDA 11.5 was not recognizing my 3070 on Ubuntu Server for some reason). I still had the same issue with the latest image, but if I built from source with FROM nvidia/cuda:11.4.1-base-ubuntu20.04 it seems fixed.

PTRFRLL commented 2 years ago

Can you try this tag and see if it fixes it?

ghcr.io/ptrfrll/nv-docker-trex:test

ksnell88 commented 2 years ago

That one worked. What was the change?

PTRFRLL commented 2 years ago

I updated the base image to CUDA 11.4 like you did in yours. Glad it's working now. I pushed the changes to the latest tag