NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.48k stars 577 forks source link

Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: signal: segmentation fault (core dumped), stdout: , stderr: \\\"\"": unknown #171

Closed wxitzxg closed 2 months ago

wxitzxg commented 4 years ago

[root@k8s-node1 docker]# nvidia-docker version NVIDIA Docker: 2.3.0 Client: Docker Engine - Community Version: 19.03.9 API version: 1.39 Go version: go1.13.10 Git commit: 9d988398e7 Built: Fri May 15 00:25:27 2020 OS/Arch: linux/amd64 Experimental: false

Server: Docker Engine - Community Engine: Version: 18.09.6 API version: 1.39 (minimum version 1.12) Go version: go1.10.8 Git commit: 481bc77 Built: Sat May 4 02:02:43 2019 OS/Arch: linux/amd64 Experimental: false

klueska commented 4 years ago

Can you tell me what the results of running the following commands are:

$ nvidia-container-runtime-hook
$ nvidia-container-toolkit
klueska commented 4 years ago

I'm also curious what OS you are on (i.e. centos8, ubunt18.04, etc.). Trying to determine if it might be related to https://github.com/NVIDIA/nvidia-docker/issues/1280#issuecomment-630754999 or something different.

klueska commented 4 years ago

New packages have been published that should resolve this issue. Please run one of the following depending on your platform:

sudo apt-get install nvidia-container-toolkit
sudo yum install nvidia-container-toolkit

If you originally installed nvidia-docker2 and not nvidia-container-toolkit, you should still run the commands above in order to update nvidia-docker2 properly (it has a dependence on nvidia-container-toolkit that will now be upgraded).

klueska commented 4 years ago

Please confirm if the new packages resolve your issue and close this issue if so.

Hsuey commented 3 years ago

Hello, after I installed the nvidia-container-toolkit package, there are still such problems:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout:, stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded\\\\n \\\"\"": unknown.
ERRO[0000] error waiting for container: context canceled
$ nvidia-container-runtime-hook
Usage of nvidia-container-runtime-hook:
  -config string
        configuration file
  -debug
        enable debug output

Commands:
  prestart
        run the prestart hook
  poststart
        no-op
  poststop
        no-op

$ nvidia-container-toolkit
Usage of nvidia-container-toolkit:
  -config string
        configuration file
  -debug
        enable debug output

Commands:
  prestart
        run the prestart hook
  poststart
        no-op
  poststop
        no-op

My system is: Ubuntu 20.04 (Windows10 WSL2)

What can I do to solve this problem? @klueska

klueska commented 3 years ago

@Hsuey What nvidia-driver version do you have installed?

Hsuey commented 3 years ago

@klueska I installed version is CUDA Toolkit 11.1 .

Hsuey commented 3 years ago

@klueska How to resolve this problem?

klueska commented 3 years ago

I just saw your system is Ubuntu 20.04 (Windows10 WSL2). I'm not that familiar with debugging issue on WSL 2. Hopefully @dualvtable can help.

Hsuey commented 3 years ago

@dualvtable Can you help me to resolve this problem?

klueska commented 3 years ago

As far as I know, running the device plugin on WSL2 is not yet supported. @dualvtable can comment more, but I'm pretty sure it won't work because the device plugin requires NVML, which is not available in a WSL2 environment.

Hsuey commented 3 years ago

@klueska But in this document https://docs.nvidia.com/cuda/wsl-user-guide/index.html, CUDA can run in WSL2, but I failed to install it. I don’t know what went wrong. My problem : When I run this command docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded\\\\n\\\"\"": unknown. ERRO[0000] error waiting for container: context canceled

Hsuey commented 3 years ago

Is this document fake? @klueska

dualvtable commented 3 years ago

hi @Hsuey - what is the Windows NVIDIA driver version that you're in the system? Did you download at least 465.12 from https://developer.nvidia.com/cuda/wsl and then follow the steps as described in the user guide?

Hsuey commented 3 years ago

@dualvtable Yes, I installed GEFORCE DRIVER(465.12_gameready_win10-dch_64bit_international.exe)

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] commented 2 months ago

This issue was automatically closed due to inactivity.