dusty-nv / jetson-containers

Machine Learning Containers for NVIDIA Jetson and JetPack-L4T
MIT License
2.13k stars 440 forks source link

Error response from daemon: could not select device driver "" with capabilities: [[gpu]]. #318

Open tanu-04 opened 10 months ago

tanu-04 commented 10 months ago

ARCH: x86_64 [sudo] password for editha: CONTAINER_IMAGE: dustynv/jetson-inference:22.06 DATA_VOLUME: --volume /home/editha/jetson_inference/jetson-inference/data:/jetson-inference/data --volume /home/editha/jetson_inference/jetson-inference/python/training/classification/data:/jetson-inference/python/training/classification/data --volume /home/editha/jetson_inference/jetson-inference/python/training/classification/models:/jetson-inference/python/training/classification/models --volume /home/editha/jetson_inference/jetson-inference/python/training/detection/ssd/data:/jetson-inference/python/training/detection/ssd/data --volume /home/editha/jetson_inference/jetson-inference/python/training/detection/ssd/models:/jetson-inference/python/training/detection/ssd/models --volume /home/editha/jetson_inference/jetson-inference/python/www/recognizer/data:/jetson-inference/python/www/recognizer/data docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

i am trying to train Jetson inference classification model on DGX station but I got this error. how can I resolve this? my docker container version is 19.03.13 .and I git cloned the repo to cd jetson-inference docker/run.sh and then i got the error

dusty-nv commented 10 months ago

Hi @tanu-04 have you been able to run any other GPU containers on your docker install on your DGX Station and have the GPU(s) working?

What I would do, is just use a more recent NGC pytorch container image, and mount this repo into it:

https://github.com/dusty-nv/pytorch-classification https://github.com/dusty-nv/pytorch-ssd

You don't need the entire jetson-inference repo to do the training, those PyTorch training scripts are in submodules.

tanu-04 commented 10 months ago

Thank you @dusty-nv I am able to train the model now without that issue but It worked without even making a docker image. But I still have that issue with every docker, I am able to use the GPU for training other models. But specifically Docker container is not able to access the GPU .

dusty-nv commented 10 months ago

But specifically Docker container is not able to access the GPU .

If you are meaning on x86, use a more recent NGC pytorch container that supports your GPU: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch

shaiksuhel1999 commented 9 months ago

Hi, In my case also getting the same error when I'm trying to run the docker container using the below command

docker run --gpus all image-id

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

Basically I have Created VM using the default Amzon AMI which is verified by Amazon

These are AMI details

GPU (Kernel 4.14) AMI name: amzn2-ami-ecs-gpu-hvm-2.0.20231103-x86_64-ebs ECS Agent version: 1.79.0 Docker version: 20.10.25 Containerd version: 1.6.19 NVIDIA driver version: 535.54.03 CUDA version: 12.2.0 Source AMI name: amzn2-ami-minimal-hvm-2.0.20230926.0-x86_64-ebs

I'm using the below commands to erase the old nvidia-driver (535.54.03)and trying to install new nvidia-driver(535.129.03) version with below commands which are given in aws documentation

sudo yum remove nvidia sudo yum remove cuda sudo yum erase nvidia cuda sudo yum update -y sudo amazon-linux-extras install kernel-5.15 sudo yum install gcc make && sudo yum update -y sudo reboot sudo yum install -y gcc kernel-devel-$(uname -r) chmod +x NVIDIA-Linux-x86_64.run sudo CC=/usr/bin/gcc10-cc ./NVIDIA-Linux-x86_64.run sudo touch /etc/modprobe.d/nvidia.conf echo "options nvidia NVreg_EnableGpuFirmware=0" | sudo tee --append /etc/modprobe.d/nvidia.conf sudo reboot

After following the Above commands I'm able to upgrade nvidia-driver version to 535.129.03 And kernel also I'm able to upgrade to 5.15, But when I'm Running docker container facing the above mentioned issue.

Any Suggestions?