NVIDIA / nvidia-docker

Build and run Docker containers leveraging NVIDIA GPUs
Apache License 2.0
17.19k stars 2.03k forks source link

could not select device driver "" with capabilities: [[gpu]]. #1034

Closed xwjBupt closed 5 years ago

xwjBupt commented 5 years ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Also, before reporting a new issue, please make sure that:


1. Issue or feature description

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]. ERRO[0000] error waiting for container: context canceled

2. Steps to reproduce the issue

docker run --gpus all nvidia/cuda:9.0-base nvidia-smi

3. Information to attach (optional if deemed irrelevant)

Timestamp : Thu Aug 1 16:42:50 2019 Driver Version : 430.34 CUDA Version : 10.1

Attached GPUs : 1 GPU 00000000:01:00.0 Product Name : GeForce GTX 1080 Ti Product Brand : GeForce Display Mode : Enabled Display Active : Enabled Persistence Mode : Disabled Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 0321317150114 GPU UUID : GPU-de7e8698-487b-d7ec-a77f-1aa89b7f31ef Minor Number : 0 VBIOS Version : 86.02.39.00.01 MultiGPU Board : No Board ID : 0x100 GPU Part Number : 900-1G611-0050-000 Inforom Version Image Version : G001.0000.01.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization mode : None IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x01 Device : 0x00 Domain : 0x0000 Device Id : 0x1B0610DE Bus Id : 00000000:01:00.0 Sub System Id : 0x120F10DE GPU Link Info PCIe Generation Max : 3 Current : 3 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 96000 KB/s Rx Throughput : 449000 KB/s Fan Speed : 32 % Performance State : P0 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 11177 MiB Used : 151 MiB Free : 11026 MiB BAR1 Memory Usage Total : 256 MiB Used : 5 MiB Free : 251 MiB Compute Mode : Default Utilization Gpu : 3 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Temperature GPU Current Temp : 59 C GPU Shutdown Temp : 96 C GPU Slowdown Temp : 93 C GPU Max Operating Temp : N/A Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 63.66 W Power Limit : 250.00 W Default Power Limit : 250.00 W Enforced Power Limit : 250.00 W Min Power Limit : 125.00 W Max Power Limit : 300.00 W Clocks Graphics : 1480 MHz SM : 1480 MHz Memory : 5508 MHz Video : 1252 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Max Clocks Graphics : 1911 MHz SM : 1911 MHz Memory : 5505 MHz Video : 1620 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes Process ID : 14718 Type : G Name : /usr/lib/xorg/Xorg Used GPU Memory : 149 MiB

期望状态=未知(u)/安装(i)/删除(r)/清除(p)/保持(h) | 状态=未安装(n)/已安装(i)/仅存配置(c)/仅解压缩(U)/配置失败(F)/不完全安装(H)/触发器等待(W)/触发器未决(T) |/ 错误?=(无)/须重装(R) (状态,错误:大写=故障) ||/ 名称 版本 体系结构: 描述 +++-========================================-=========================-=========================-===================================================================================== un nvidia-common <无> <无> (无可用描述) un nvidia-libopencl1-dev <无> <无> (无可用描述) un nvidia-prime <无> <无> (无可用描述)

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 50d5b4f6d9ff nvidia/cuda:9.0-base "nvidia-smi" 7 minutes ago Created pedantic_cannon 1afa77ec638d nvidia/cuda:9.0-base "nvidia-smi" 8 minutes ago Created eager_mahavira 1197d432da4a nvidia/cuda:9.0-base "nvidia-smi" 9 minutes ago Created hardcore_nobel 49cbf6bed9f5 nvidia/cuda:9.0-base "nvidia-smi" 11 minutes ago Created recursing_moser 67dd0a5eefe1 nvidia/cuda:10.1-base "nvidia-smi" 15 minutes ago Created hungry_wing fed1e2ab13f8 nvidia/cuda:10.1-base "nvidia-smi" 18 minutes ago Created nice_neumann

i have tried all of those images,but i got the same issue: docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]. ERRO[0000] error waiting for container: context canceled

what should i do ? thank you so much!!!

MaierOli2010 commented 5 years ago

I had the same problem. The instructions on this page solved the problem for me.

xwjBupt commented 5 years ago

@MaierOli2010 ok, i will try ,thanks!!

RenaudWasTaken commented 5 years ago

Hello!

If you didn't already make sure you've installed the nvidia-container-toolkit. If this doesn't fix it for you, make sure you've restarted docker systemctl restart dockerd

fengyuentau commented 4 years ago

The sentence from readme saying, 'Note that with the release of Docker 19.03, usage of nvidia-docker2 packages are deprecated since NVIDIA GPUs are now natively supported as devices in the Docker runtime.', is misleading me that I think it's ready to go after installing Docker 19.03, but actually will fail when following the commands from Usage section.

@RenaudWasTaken I think you should consider putting the sentence 'For first-time users of Docker 19.03 and GPUs, continue with the instructions for getting started below.' together in a paragraph with the deprecated note.

ghost commented 4 years ago

ya, just had to restart the docker daemon: sudo systemctl restart docker

martrim commented 4 years ago

Hello, I am having the same problem, dispite following the instructions on NVIDIA's docker page and the advice of @RenaudWasTaken from above. After installing the NVIDIA Container toolkit with

# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

I tried to run docker run --gpus all nvidia/cuda:10.0-base nvidia-smi I get the same error message as xwjBupt above:

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0001] error waiting for container: context canceled 

I am using Ubuntu 16.04. I checked that I have got the drivers installed: When running nvidia-smi, I get

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130                Driver Version: 384.130                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:02:00.0 Off |                  N/A |
| 23%   29C    P8     9W / 250W |      1MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:03:00.0  On |                  N/A |
| 23%   40C    P5    21W / 250W |    552MiB / 12181MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    1      1841      G   /usr/lib/xorg/Xorg                           339MiB |
|    1      3929      G   compiz                                       158MiB |
|    1      4333      G   ...uest-channel-token=11773295769896974116    53MiB |
+-----------------------------------------------------------------------------+

Also, docker --version gives me Docker version 19.03.8, build afacb8b7f0. Please just let me know if you need more information.

billwhiteley commented 4 years ago

I am having the same issue, with the same version of Docker and the 440.82 driver. If I use --gpus the container launch fails, but without --gpus it works fine.

johndpope commented 4 years ago

I followed above blog on collabnix.com- which spelled out this installation - but it was missing this last line to restart docker - then it worked.

sh nvidia-container-runtime-script.sh


curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt-get update
sudo systemctl restart docker
lfernandes commented 4 years ago

Hello!

If you didn't already make sure you've installed the nvidia-container-toolkit. If this doesn't fix it for you, make sure you've restarted docker systemctl restart dockerd

updated documentation of how to to this

tashrifbillah commented 4 years ago

Does anyone have a solution for RHEL? All the above are for Ubuntu only!

strarsis commented 3 years ago

From the NVIDIA CUDA/WSL 2 documentation:

Use the Docker installation script to install Docker for your choice of WSL 2 Linux distribution. Note that NVIDIA Container Toolkit does not yet support Docker Desktop WSL 2 backend.

mihajenko commented 3 years ago

I had the same issue, followed all the steps to reinstall docker ... The problem was I also installed docker previously via snap, notably through Ubuntu OS' installer. So, issues:

So, removing the snap version goes like this:

sudo snap remove docker --purge

After that, follow the official instructions to reinstall docker, and be sure to remove the docker-ce package, if on Ubuntu.

sanjaymarison commented 3 years ago

Does anyone know how to rectify it in EC2 instance?

lakshayc-ss commented 3 years ago

This worked for me in EC2

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit sudo reboot

20v100 commented 3 years ago

Ensure you have the proper driver and de-activate the ubuntu default VGA driver: https://gist.github.com/nathzi1505/d2aab27ff93a3a9d82dada1336c45041 and https://www.server-world.info/en/note?os=Ubuntu_20.04&p=nvidia&f=1

homairs commented 3 years ago

I followed above blog on collabnix.com- which spelled out this installation - but it was missing this last line to restart docker - then it worked.

sh nvidia-container-runtime-script.sh

curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt-get update
sudo systemctl restart docker

I have the same error <docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]> when I run a docker container using the --gpus all flag. I have already tried the above solutions and ensured to restart the docker as well as rebooting the system. However, I still get the same error as before.

bkocis commented 2 years ago

Install the Nvidia Container toolkit: Reference: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update

sudo apt-get install -y nvidia-docker2

sudo systemctl restart docker
SbstnErhrdt commented 2 years ago

did not work in my case ... until rebooted...

sometimes you have to turn things off and on again to satisfy the it gods

Kill3rWhale commented 2 years ago

Hello!

If you didn't already make sure you've installed the nvidia-container-toolkit. If this doesn't fix it for you, make sure you've restarted docker systemctl restart dockerd

This worked for me - thanks!

fanwz commented 1 year ago

I recently upgraded docker and encountered this issue as well, under the CentOS 7 environment. I tried various methods but they didn't work, and then I reinstalled the nvidia-container-toolkit and it was fixed.

quancs commented 1 year ago

After reinstalling docker (see https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository) and nvidia-docker2 (see https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker), the problem is solved.