NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.86k stars 635 forks source link

Cannont pass through RTX 3090 into pod; Failed to initialize NVML: could not load NVML library. #263

Open davidho27941 opened 3 years ago

davidho27941 commented 3 years ago

1. Issue or feature description

Cannot pass through RTX 3090 GPU by k8s-device-plugin(both k8s-only or helm failed.)

2. Steps to reproduce the issue

My kubeadm version: 1.21.1 My kubectl version: 1.21.1 My kubelet version: 1.21.1 My CRI-O version: 1.21:1.21.1

I was trying to create a cluster using crio container runtime interface and flannel CNI.

My command for initialize cluster: sudo kubeadm init --cri-socket /var/run/crio/crio.sock --pod-network-cidr 10.244.0.0/16

Adding flannel: kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

Adding k8s-device-plugin by nvidia: kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

Then, the information below is the logs reported by nvidia-device-plugin-daemonset-llthp pod:

2021/08/30 06:04:38 Loading NVML
2021/08/30 06:04:38 Failed to initialize NVML: could not load NVML library.
2021/08/30 06:04:38 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2021/08/30 06:04:38 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2021/08/30 06:04:38 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2021/08/30 06:04:38 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes

Once I try to establish a pod using the following yaml:

apiVersion: v1
kind: Pod
metadata:
  name: torch
  labels:
    app: torch
spec:
  containers:
  - name: torch
    image: nvcr.io/nvidia/pytorch:21.03-py3
    #command: [ "/bin/bash", "-c", "--" ]
    #args: [ "while true; do sleep 30; done;" ]
    ports:
      - containerPort: 8888
        protocol: TCP
    resources:
      requests:
        nvidia.com/gpu: 1
        memory: "64Mi"
        cpu: "250m"
      limits:
        nvidia.com/gpu: 1
        memory: "128Mi"
        cpu: "500m"

The Kubernetes failed to get the GPU

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  15s (x3 over 92s)  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

But the docker works without error when I try to run:

 docker run --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:v0.9.0

Output:

2021/08/30 10:38:09 Loading NVML
2021/08/30 10:38:09 Starting FS watcher.
2021/08/30 10:38:09 Starting OS watcher.
2021/08/30 10:38:09 Retreiving plugins.
2021/08/30 10:38:09 Starting GRPC server for 'nvidia.com/gpu'
2021/08/30 10:38:09 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2021/08/30 10:38:09 Registered device plugin for 'nvidia.com/gpu' with Kubelet

It seems docker can pass through GPU successfully but k8s do not.

Can anybody help me to figure out the problem?

3. Information to attach (optional if deemed irrelevant)

Common error checking:

Timestamp : Mon Aug 30 18:22:17 2021 Driver Version : 460.73.01 CUDA Version : 11.2

Attached GPUs : 1 GPU 00000000:01:00.0 Product Name : GeForce RTX 3090 Product Brand : GeForce Display Mode : Enabled Display Active : Enabled Persistence Mode : Enabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-948211b6-df7a-5768-ca7b-a84e23d9404d Minor Number : 0 VBIOS Version : 94.02.26.08.1C MultiGPU Board : No Board ID : 0x100 GPU Part Number : N/A Inforom Version Image Version : G001.0000.03.03 OEM Object : 2.0 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x01 Device : 0x00 Domain : 0x0000 Device Id : 0x220410DE Bus Id : 00000000:01:00.0 Sub System Id : 0x403B1458 GPU Link Info PCIe Generation Max : 4 Current : 1 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 1000 KB/s Rx Throughput : 1000 KB/s Fan Speed : 41 % Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 24265 MiB Used : 1256 MiB Free : 23009 MiB BAR1 Memory Usage Total : 256 MiB Used : 14 MiB Free : 242 MiB Compute Mode : Default Utilization Gpu : 1 % Memory : 10 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 48 C GPU Shutdown Temp : 98 C GPU Slowdown Temp : 95 C GPU Max Operating Temp : 93 C GPU Target Temperature : 83 C Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 34.64 W Power Limit : 350.00 W Default Power Limit : 350.00 W Enforced Power Limit : 350.00 W Min Power Limit : 100.00 W Max Power Limit : 350.00 W Clocks Graphics : 270 MHz SM : 270 MHz Memory : 405 MHz Video : 555 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Max Clocks Graphics : 2100 MHz SM : 2100 MHz Memory : 9751 MHz Video : 1950 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes GPU instance ID : N/A Compute instance ID : N/A Process ID : 2692 Type : G Name : /usr/lib/xorg/Xorg Used GPU Memory : 73 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 3028 Type : G Name : /usr/bin/gnome-shell Used GPU Memory : 160 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 5521 Type : G Name : /usr/lib/xorg/Xorg Used GPU Memory : 624 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 5654 Type : G Name : /usr/bin/gnome-shell Used GPU Memory : 84 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 8351 Type : G Name : /usr/share/skypeforlinux/skypeforlinux --type=gpu-process --field-trial-handle=2437345894369599647,6238031376657225521,131072 --enable-features=WebComponentsV0Enabled --disable-features=CookiesWithoutSameSiteMustBeSecure,SameSiteByDefaultCookies,SpareRendererForSitePerProcess --enable-crash-reporter=97d5b09d-f9b0-4336-bc9a-fe11870fe1b3,no_channel --global-crash-keys=97d5b09d-f9b0-4336-bc9a-fe11870fe1b3,no_channel,_companyName=Skype,_productName=skypeforlinux,_version=8.73.0.92 --gpu-preferences=OAAAAAAAAAAgAAAQAAAAAAAAAAAAAAAAAABgAAAAAAAYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAA== --shared-files Used GPU Memory : 14 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 8560 Type : G Name : /opt/google/chrome/chrome --type=gpu-process --field-trial-handle=10043073040938675921,16429150098372267894,131072 --enable-crashpad --crashpad-handler-pid=8526 --enable-crash-reporter=a844a16f-8f0f-4770-87e1-a8389ca3c415, --gpu-preferences=UAAAAAAAAAAgAAAQAAAAAAAAAAAAAAAAAABgAAAAAAAwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAQAAABgAAAAAAAAAGAAAAAAAAAAIAAAAAAAAAAgAAAAAAAAACAAAAAAAAAA= --shared-files Used GPU Memory : 91 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 8582 Type : G Name : /usr/lib/firefox/firefox Used GPU Memory : 178 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 9139 Type : G Name : /usr/lib/firefox/firefox Used GPU Memory : 4 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 9931 Type : G Name : gnome-control-center Used GPU Memory : 4 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 11503 Type : G Name : /usr/lib/firefox/firefox Used GPU Memory : 4 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 64276 Type : G Name : /usr/lib/firefox/firefox Used GPU Memory : 4 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 78463 Type : G Name : /usr/lib/firefox/firefox Used GPU Memory : 4 MiB

 - [x ] Your docker configuration file (e.g: `/etc/docker/daemon.json`)

{ "exec-opts": ["native.cgroupdriver=systemd"], "log-driver": "json-file", "log-opts": { "max-size": "100m" }, "storage-driver": "overlay2", "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }

 - [x] The k8s-device-plugin container logs

2021/08/30 06:04:38 Loading NVML 2021/08/30 06:04:38 Failed to initialize NVML: could not load NVML library. 2021/08/30 06:04:38 If this is a GPU node, did you set the docker default runtime to nvidia? 2021/08/30 06:04:38 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites 2021/08/30 06:04:38 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start 2021/08/30 06:04:38 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes

 - [x] The kubelet logs on the node (e.g: `sudo journalctl -r -u kubelet`)

八 30 14:12:23 srv1 kubelet[108111]: I0830 14:12:23.643580 108111 eviction_manager.go:346] "Eviction manager: able to reduce resource pressure without evicting pods." resourceName="ephemeral-storage" 八 30 14:12:23 srv1 kubelet[108111]: I0830 14:12:23.457677 108111 eviction_manager.go:425] "Eviction manager: unexpected error when attempting to reduce resource pressure" resourceName="ephemeral-storage" err="wanted to free 9223372036854775807 bytes, but freed 14575560277 bytes space with errors in image deletion: [rpc error: code = U 八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.404808 108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="ad8c213c76c5990969673d7a22ed6bce9d13e6cdd613fefd2db967a03e1cd816" size=14575560277 八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404791 108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by 864db3a48c0a2753840a7f994873c2c5af696d6765aeb229b49e455ea5e98c4c: image is in use by a container" image="296a6d5035e2d6919249e02709a488d680ddca91357602bd65e605eac967b89 八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404762 108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by 864db3a48c0a2753840a7f994873c2c5af696d6765aeb229b49e455ea5e98c4c: image is in use by a container" image="296a6d5035e2d6919249e02709a488d680ddca91357602bd65e60 八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.404494 108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="296a6d5035e2d6919249e02709a488d680ddca91357602bd65e605eac967b899" size=42585056 八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404479 108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by aa72c61c4181efcc0f55c70f42078481cc0af69654343aa98edd6bfac63290ba: image is in use by a container" image="8522d622299ca431311ac69992419c956fbaca6fa8289c76810c9399d17c69d 八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404467 108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by aa72c61c4181efcc0f55c70f42078481cc0af69654343aa98edd6bfac63290ba: image is in use by a container" image="8522d622299ca431311ac69992419c956fbaca6fa8289c76810c9 八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.404230 108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="8522d622299ca431311ac69992419c956fbaca6fa8289c76810c9399d17c69de" size=68899837 八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404212 108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by b53c6818067e6b95f5e4689d991f86524bb4e47baec455a0211168b321e1af1b: image is in use by a container" image="37b8c3899b153afc2c7e65e1939330654276560b8b5f6dffdfd466bd8b4f7ef 八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404187 108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by b53c6818067e6b95f5e4689d991f86524bb4e47baec455a0211168b321e1af1b: image is in use by a container" image="37b8c3899b153afc2c7e65e1939330654276560b8b5f6dffdfd46 八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.403939 108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="37b8c3899b153afc2c7e65e1939330654276560b8b5f6dffdfd466bd8b4f7ef8" size=195847465 八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403932 108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by 83c0ea8f464dc205726d29d407f564b5115e9b80bd65bac2f087463d80ff95ed: image is in use by a container" image="2c25d0f89db7a9dba5ed71b692b65e86b0ad9fcab1a9f94e946c05db18776ab 八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403920 108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by 83c0ea8f464dc205726d29d407f564b5115e9b80bd65bac2f087463d80ff95ed: image is in use by a container" image="2c25d0f89db7a9dba5ed71b692b65e86b0ad9fcab1a9f94e946c0 八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.403680 108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="2c25d0f89db7a9dba5ed71b692b65e86b0ad9fcab1a9f94e946c05db18776ab3" size=121095258 八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403673 108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by ca4d555dce70b78abd85986745371d98c2028590ae058e2320ce457f5fec0b30: image is in use by a container" image="0369cf4303ffdb467dc219990960a9baa8512a54b0ad9283eaf55bd6c0adb93 八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403663 108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by ca4d555dce70b78abd85986745371d98c2028590ae058e2320ce457f5fec0b30: image is in use by a container" image="0369cf4303ffdb467dc219990960a9baa8512a54b0ad9283eaf55 八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.403428 108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="0369cf4303ffdb467dc219990960a9baa8512a54b0ad9283eaf55bd6c0adb934" size=254662613 八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403422 108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by d4a7103f1e4829474bab231668d0377b97fc222e2a4b4332a669e912b863175a: image is in use by a container" image="993d3ec13feb2e7b7e9bd6ac4831fb0cdae7329a8e8f1e285d9f2790004b2fe 八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403412 108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by d4a7103f1e4829474bab231668d0377b97fc222e2a4b4332a669e912b863175a: image is in use by a container" image="993d3ec13feb2e7b7e9bd6ac4831fb0cdae7329a8e8f1e285d9f2 八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.403187 108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="993d3ec13feb2e7b7e9bd6ac4831fb0cdae7329a8e8f1e285d9f2790004b2fe3" size=51893338 八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403180 108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by 40de580961ae274afef6eb2737f313bc8637ac21fc42fa53863a97523c07c831: image is in use by a container" image="cef7457710b1ace64357066aea33117083dfec9a023cade594cc16c7a81d936 八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403171 108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by 40de580961ae274afef6eb2737f313bc8637ac21fc42fa53863a97523c07c831: image is in use by a container" image="cef7457710b1ace64357066aea33117083dfec9a023cade594cc1 八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.402907 108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="cef7457710b1ace64357066aea33117083dfec9a023cade594cc16c7a81d936b" size=126883060 八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.402897 108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by 63d4a6aaa8f530cb3e33f02af9262d2ffd20f076b5803bc1ea1f03fc29f9ebf3: image is in use by a container" image="ef4bce0a7569b4fa83a559717c608c076a2c9d30361eb059ea4e1b7a55424d6 八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.402886 108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by 63d4a6aaa8f530cb3e33f02af9262d2ffd20f076b5803bc1ea1f03fc29f9ebf3: image is in use by a container" image="ef4bce0a7569b4fa83a559717c608c076a2c9d30361eb059ea4e1 八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.402498 108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="ef4bce0a7569b4fa83a559717c608c076a2c9d30361eb059ea4e1b7a55424d68" size=105130216 八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.402486 108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by 9270341c09e80de42955681f04bb0baaac9f931e7e4eb6aa400a7419337e107b: image is in use by a container" image="ed210e3e4a5bae1237f1bb44d72a05a2f1e5c6bfe7a7e73da179e2534269c45 八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.402467 108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by 9270341c09e80de42955681f04bb0baaac9f931e7e4eb6aa400a7419337e107b: image is in use by a container" image="ed210e3e4a5bae1237f1bb44d72a05a2f1e5c6bfe7a7e73da179e 八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.402130 108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="ed210e3e4a5bae1237f1bb44d72a05a2f1e5c6bfe7a7e73da179e2534269c459" size=689969 八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.400313 108111 image_gc_manager.go:321] "Attempting to delete unused images" 八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.398657 108111 container_gc.go:85] "Attempting to delete unused containers" 八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.398622 108111 eviction_manager.go:339] "Eviction manager: attempting to reclaim" resourceName="ephemeral-storage" 八 30 14:12:10 srv1 kubelet[108111]: I0830 14:12:10.205926 108111 eviction_manager.go:391] "Eviction manager: unable to evict any pods from the node"

Additional information that might help better understand your environment and reproduce the bug:
 - [x] Docker version from `docker version`

Client: Docker Engine - Community Version: 20.10.0 API version: 1.41 Go version: go1.13.15 Git commit: 7287ab3 Built: Tue Dec 8 18:59:53 2020 OS/Arch: linux/amd64 Context: default Experimental: true

Server: Docker Engine - Community Engine: Version: 20.10.0 API version: 1.41 (minimum version 1.12) Go version: go1.13.15 Git commit: eeddea2 Built: Tue Dec 8 18:57:44 2020 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.9 GitCommit: e25210fe30a0a703442421b0f60afac609f950a3 nvidia: Version: 1.0.1 GitCommit: v1.0.1-0-g4144b63 docker-init: Version: 0.19.0 GitCommit: de40ad0

 - [ ] Docker command, image and tag used
 - [x] Kernel version from `uname -a`

Linux srv1 5.4.0-56-generic #62~18.04.1-Ubuntu SMP Tue Nov 24 10:07:50 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

 - [ ] Any relevant kernel output lines from `dmesg`
 - [x] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`

||/ Name Version Architecture Description +++-===============================================================================-============================================-============================================-=================================================================================================================================================================== un libgldispatch0-nvidia (no description available) ii libnvidia-cfg1-460:amd64 460.73.01-0ubuntu1 amd64 NVIDIA binary OpenGL/GLX configuration library un libnvidia-cfg1-any (no description available) un libnvidia-common (no description available) ii libnvidia-common-460 460.73.01-0ubuntu1 all Shared files used by the NVIDIA libraries ii libnvidia-compute-460:amd64 460.73.01-0ubuntu1 amd64 NVIDIA libcompute package ii libnvidia-container-tools 1.4.0-1 amd64 NVIDIA container runtime library (command-line tools) ii libnvidia-container1:amd64 1.4.0-1 amd64 NVIDIA container runtime library un libnvidia-decode (no description available) ii libnvidia-decode-460:amd64 460.73.01-0ubuntu1 amd64 NVIDIA Video Decoding runtime libraries un libnvidia-encode (no description available) ii libnvidia-encode-460:amd64 460.73.01-0ubuntu1 amd64 NVENC Video Encoding runtime library un libnvidia-extra (no description available) ii libnvidia-extra-460:amd64 460.73.01-0ubuntu1 amd64 Extra libraries for the NVIDIA driver un libnvidia-fbc1 (no description available) ii libnvidia-fbc1-460:amd64 460.73.01-0ubuntu1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library un libnvidia-gl (no description available) ii libnvidia-gl-460:amd64 460.73.01-0ubuntu1 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD un libnvidia-ifr1 (no description available) ii libnvidia-ifr1-460:amd64 460.73.01-0ubuntu1 amd64 NVIDIA OpenGL-based Inband Frame Readback runtime library un libnvidia-ml1 (no description available) un nvidia-304 (no description available) un nvidia-340 (no description available) un nvidia-384 (no description available) un nvidia-390 (no description available) un nvidia-common (no description available) ii nvidia-compute-utils-460 460.73.01-0ubuntu1 amd64 NVIDIA compute utilities ii nvidia-container-runtime 3.5.0-1 amd64 NVIDIA container runtime un nvidia-container-runtime-hook (no description available) ii nvidia-container-toolkit 1.5.1-1 amd64 NVIDIA container runtime hook ii nvidia-cuda-dev 9.1.85-3ubuntu1 amd64 NVIDIA CUDA development files ii nvidia-cuda-doc 9.1.85-3ubuntu1 all NVIDIA CUDA and OpenCL documentation ii nvidia-cuda-gdb 9.1.85-3ubuntu1 amd64 NVIDIA CUDA Debugger (GDB) ii nvidia-cuda-toolkit 9.1.85-3ubuntu1 amd64 NVIDIA CUDA development toolkit ii nvidia-dkms-460 460.73.01-0ubuntu1 amd64 NVIDIA DKMS package un nvidia-dkms-kernel (no description available) un nvidia-driver (no description available) ii nvidia-driver-460 460.73.01-0ubuntu1 amd64 NVIDIA driver metapackage un nvidia-driver-binary (no description available) un nvidia-kernel-common (no description available) ii nvidia-kernel-common-460 460.73.01-0ubuntu1 amd64 Shared files used with the kernel module un nvidia-kernel-source (no description available) ii nvidia-kernel-source-460 460.73.01-0ubuntu1 amd64 NVIDIA kernel source package un nvidia-legacy-304xx-vdpau-driver (no description available) un nvidia-legacy-340xx-vdpau-driver (no description available) un nvidia-libopencl1 (no description available) un nvidia-libopencl1-dev (no description available) ii nvidia-modprobe 465.19.01-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files ii nvidia-opencl-dev:amd64 9.1.85-3ubuntu1 amd64 NVIDIA OpenCL development files un nvidia-opencl-icd (no description available) un nvidia-persistenced (no description available) ii nvidia-prime 0.8.16~0.18.04.1 all Tools to enable NVIDIA's Prime ii nvidia-profiler 9.1.85-3ubuntu1 amd64 NVIDIA Profiler for CUDA and OpenCL ii nvidia-settings 465.19.01-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver un nvidia-settings-binary (no description available) un nvidia-smi (no description available) un nvidia-utils (no description available) ii nvidia-utils-460 460.73.01-0ubuntu1 amd64 NVIDIA driver support binaries un nvidia-vdpau-driver (no description available) ii nvidia-visual-profiler 9.1.85-3ubuntu1 amd64 NVIDIA Visual Profiler for CUDA and OpenCL ii xserver-xorg-video-nvidia-460 460.73.01-0ubuntu1 amd64 NVIDIA binary Xorg driver

 - [x] NVIDIA container library version from `nvidia-container-cli -V`

version: 1.4.0 build date: 2021-04-24T14:25+00:00 build revision: 704a698b7a0ceec07a48e56c37365c741718c2df build compiler: x86_64-linux-gnu-gcc-7 7.5.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections


 - [ ] NVIDIA container library logs (see [troubleshooting](https://github.com/NVIDIA/nvidia-docker/wiki/Troubleshooting))
elezar commented 3 years ago

@davidho27941 I see from your description that you are installing version 1.0.0-beta4 of the device plugin:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

The versioning of the NVIDIA Device plugin is inconsistent in that v0.9.0 is the latest release: https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.9.0

Could see whether using this (or one of the more recent releases) addresses your issue?

davidho27941 commented 3 years ago

0> @davidho27941 I see from your description that you are installing version 1.0.0-beta4 of the device plugin:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

The versioning of the NVIDIA Device plugin is inconsistent in that v0.9.0 is the latest release: https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.9.0

Could see whether using this (or one of the more recent releases) addresses your issue?

Hi @elezar,

Actually, I also failed with that version.....

The v0.9.0 one also failed to load NVML library.

I just do a check, the output is shown below:

2021/08/30 10:44:12 Loading NVML
2021/08/30 10:44:12 Failed to initialize NVML: could not load NVML library.
2021/08/30 10:44:12 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2021/08/30 10:44:12 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2021/08/30 10:44:12 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2021/08/30 10:44:12 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes

The output of kubectl describe pod -n kubesystem nvidia-device-plugin-daemonset-rwng2

Name:                 nvidia-device-plugin-daemonset-rwng2
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 srv1/192.168.50.248
Start Time:           Mon, 30 Aug 2021 18:44:02 +0800
Labels:               controller-revision-hash=9d47c6878
                      name=nvidia-device-plugin-ds
                      pod-template-generation=1
Annotations:          scheduler.alpha.kubernetes.io/critical-pod: 
Status:               Running
IP:                   10.244.0.8
IPs:
  IP:           10.244.0.8
Controlled By:  DaemonSet/nvidia-device-plugin-daemonset
Containers:
  nvidia-device-plugin-ctr:
    Container ID:  cri-o://7a4820d5ba7d657245b1a8300519bcfda0a1ccd73d33d848f7762ba5e19a4b47
    Image:         nvcr.io/nvidia/k8s-device-plugin:v0.9.0
    Image ID:      nvcr.io/nvidia/k8s-device-plugin@sha256:964847cc3fd85ead286be1d74d961f53d638cd4875af51166178b17bba90192f
    Port:          <none>
    Host Port:     <none>
    Args:
      --fail-on-init-error=false
    State:          Running
      Started:      Mon, 30 Aug 2021 18:44:12 +0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pv8tx (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  kube-api-access-pv8tx:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 CriticalAddonsOnly op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  44s   default-scheduler  Successfully assigned kube-system/nvidia-device-plugin-daemonset-rwng2 to srv1
  Normal  Pulling    43s   kubelet            Pulling image "nvcr.io/nvidia/k8s-device-plugin:v0.9.0"
  Normal  Pulled     35s   kubelet            Successfully pulled image "nvcr.io/nvidia/k8s-device-plugin:v0.9.0" in 8.591371132s
  Normal  Created    34s   kubelet            Created container nvidia-device-plugin-ctr
  Normal  Started    34s   kubelet            Started container nvidia-device-plugin-ctr

Best regards, David

elezar commented 3 years ago

You also mentioned:

I was trying to create a cluster using crio container runtime interface and flannel CNI.

Does this mean that K8s is using crio to launch containers? Has crio been configured to use the NVIDIA Container Runtime or does it have the NVIDIA Container Tooklit / Hook configured?

davidho27941 commented 3 years ago

You also mentioned:

I was trying to create a cluster using crio container runtime interface and flannel CNI.

Does this mean that K8s is using crio to launch containers? Has crio been configured to use the NVIDIA Container Runtime or does it have the NVIDIA Container Tooklit / Hook configured?

Actually, I am not sure about this, but I also failed to run with containerd...

Following is my configuration for containerd:

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = []
required_plugins = []
oom_score = 0

[grpc]
  address = "/run/containerd/containerd.sock"
  tcp_address = ""
  tcp_tls_cert = ""
  tcp_tls_key = ""
  uid = 0
  gid = 0
  max_recv_message_size = 16777216
  max_send_message_size = 16777216

[ttrpc]
  address = ""
  uid = 0
  gid = 0

[debug]
  address = ""
  uid = 0
  gid = 0
  level = ""

[metrics]
  address = ""
  grpc_histogram = false

[cgroup]
  path = ""

[timeouts]
  "io.containerd.timeout.shim.cleanup" = "5s"
  "io.containerd.timeout.shim.load" = "5s"
  "io.containerd.timeout.shim.shutdown" = "3s"
  "io.containerd.timeout.task.state" = "2s"

[plugins]
  [plugins."io.containerd.gc.v1.scheduler"]
    pause_threshold = 0.02
    deletion_threshold = 0
    mutation_threshold = 100
    schedule_delay = "0s"
    startup_delay = "100ms"
  [plugins."io.containerd.grpc.v1.cri"]
    disable_tcp_service = true
    stream_server_address = "127.0.0.1"
    stream_server_port = "0"
    stream_idle_timeout = "4h0m0s"
    enable_selinux = false
    selinux_category_range = 1024
    sandbox_image = "k8s.gcr.io/pause:3.2"
    stats_collect_period = 10
    systemd_cgroup = false
    enable_tls_streaming = false
    max_container_log_line_size = 16384
    disable_cgroup = false
    disable_apparmor = false
    restrict_oom_score_adj = false
    max_concurrent_downloads = 3
    disable_proc_mount = false
    unset_seccomp_profile = ""
    tolerate_missing_hugetlb_controller = true
    disable_hugetlb_controller = true
    ignore_image_defined_volumes = false
    [plugins."io.containerd.grpc.v1.cri".containerd]
      snapshotter = "overlayfs"
      default_runtime_name = "nvidia"
      no_pivot = false
      disable_snapshot_annotations = false
      discard_unpacked_layers = false
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v1.linux"
          runtime_engine = ""
          runtime_root = ""
          privileged_without_host_devices = false
          base_runtime_spec = ""
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            SystemdCgroup = true
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            SystemdCgroup = true
            BinaryName="/usr/bin/nvidia-container-runtime"
    [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "/opt/cni/bin"
      conf_dir = "/etc/cni/net.d"
      max_conf_num = 1
      conf_template = ""
    [plugins."io.containerd.grpc.v1.cri".registry]
      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
          endpoint = ["https://registry-1.docker.io"]
    [plugins."io.containerd.grpc.v1.cri".image_decryption]
      key_model = ""
    [plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
      tls_cert_file = ""
      tls_key_file = ""
  [plugins."io.containerd.internal.v1.opt"]
    path = "/opt/containerd"
  [plugins."io.containerd.internal.v1.restart"]
    interval = "10s"
  [plugins."io.containerd.metadata.v1.bolt"]
    content_sharing_policy = "shared"
  [plugins."io.containerd.monitor.v1.cgroups"]
    no_prometheus = false
  [plugins."io.containerd.runtime.v1.linux"]
    shim = "containerd-shim"
    runtime = "runc"
    runtime_root = ""
    no_shim = false
    shim_debug = false
  [plugins."io.containerd.runtime.v2.task"]
    platforms = ["linux/amd64"]
  [plugins."io.containerd.service.v1.diff-service"]
    default = ["walking"]
  [plugins."io.containerd.snapshotter.v1.devmapper"]
    root_path = ""
    pool_name = ""
    base_image_size = ""
    async_remove = false

Best regards, David

elezar commented 3 years ago

Since the image works with docker, it would appear as if your NVIDIA Container Toolkit installation is at least sane. In oder to debug this further, could you uncomment the #debug = lines in /etc/nvidia-container-runtime/config.toml. Then run the nvidia-smi command in a container (ubuntu should do) using ctr and attach the contents of /var/log/nvidia-container-*.log to the issue. If you're able to clear those logs and then also include them when running the container using docker, that could provide a point for comparison.

davidho27941 commented 3 years ago

Since the image works with docker, it would appear as if your NVIDIA Container Toolkit installation is at least sane. In oder to debug this further, could you uncomment the #debug = lines in /etc/nvidia-container-runtime/config.toml. Then run the nvidia-smi command in a container (ubuntu should do) using ctr and attach the contents of /var/log/nvidia-container-*.log to the issue. If you're able to clear those logs and then also include them when running the container using docker, that could provide a point for comparison.

Hi

The following output is the logs when I try to run docker run nvidia/cuda:11.1-base nvidia-smi

Partial output of /var/log/nvidia-container-toolkit.log:

-- WARNING, the following logs are for debugging purposes only --

I0830 11:16:34.890421 49163 nvc.c:372] initializing library context (version=1.4.0, build=704a698b7a0ceec07a48e56c37365c741718c2df)
I0830 11:16:34.890480 49163 nvc.c:346] using root /
I0830 11:16:34.890488 49163 nvc.c:347] using ldcache /etc/ld.so.cache
I0830 11:16:34.890496 49163 nvc.c:348] using unprivileged user 65534:65534
I0830 11:16:34.890515 49163 nvc.c:389] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0830 11:16:34.890627 49163 nvc.c:391] dxcore initialization failed, continuing assuming a non-WSL environment
I0830 11:16:34.892652 49169 nvc.c:274] loading kernel module nvidia
I0830 11:16:34.892812 49169 nvc.c:278] running mknod for /dev/nvidiactl
I0830 11:16:34.892841 49169 nvc.c:282] running mknod for /dev/nvidia0
I0830 11:16:34.892858 49169 nvc.c:286] running mknod for all nvcaps in /dev/nvidia-caps
I0830 11:16:34.898656 49169 nvc.c:214] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I0830 11:16:34.898750 49169 nvc.c:214] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I0830 11:16:34.900715 49169 nvc.c:292] loading kernel module nvidia_uvm
I0830 11:16:34.900784 49169 nvc.c:296] running mknod for /dev/nvidia-uvm
I0830 11:16:34.900846 49169 nvc.c:301] loading kernel module nvidia_modeset
I0830 11:16:34.900908 49169 nvc.c:305] running mknod for /dev/nvidia-modeset
I0830 11:16:34.901104 49171 driver.c:101] starting driver service
I0830 11:16:34.903318 49163 nvc_container.c:388] configuring container with 'compute utility supervised'
I0830 11:16:34.903488 49163 nvc_container.c:236] selecting /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/local/cuda-11.1/compat/libcuda.so.455.45.01
I0830 11:16:34.903523 49163 nvc_container.c:236] selecting /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/local/cuda-11.1/compat/libnvidia-ptxjitcompiler.so.455.45.01
I0830 11:16:34.903657 49163 nvc_container.c:408] setting pid to 49105
I0830 11:16:34.903665 49163 nvc_container.c:409] setting rootfs to /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged
I0830 11:16:34.903670 49163 nvc_container.c:410] setting owner to 0:0
I0830 11:16:34.903675 49163 nvc_container.c:411] setting bins directory to /usr/bin
I0830 11:16:34.903680 49163 nvc_container.c:412] setting libs directory to /usr/lib/x86_64-linux-gnu
I0830 11:16:34.903685 49163 nvc_container.c:413] setting libs32 directory to /usr/lib/i386-linux-gnu
I0830 11:16:34.903690 49163 nvc_container.c:414] setting cudart directory to /usr/local/cuda
I0830 11:16:34.903695 49163 nvc_container.c:415] setting ldconfig to @/sbin/ldconfig.real (host relative)
I0830 11:16:34.903700 49163 nvc_container.c:416] setting mount namespace to /proc/49105/ns/mnt
I0830 11:16:34.903705 49163 nvc_container.c:418] setting devices cgroup to /sys/fs/cgroup/devices/system.slice/docker-19816aa85232e1fc7d31970489ccced5c68acbfc9f97d625ffc17387bb2e77fd.scope
I0830 11:16:34.903712 49163 nvc_info.c:676] requesting driver information with ''
I0830 11:16:34.904962 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.460.73.01
I0830 11:16:34.905038 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.460.73.01
I0830 11:16:34.905070 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.460.73.01
I0830 11:16:34.905103 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.73.01
I0830 11:16:34.905145 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.460.73.01
I0830 11:16:34.905186 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.73.01
I0830 11:16:34.905215 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.460.73.01
I0830 11:16:34.905249 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.73.01
I0830 11:16:34.905290 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.460.73.01
I0830 11:16:34.905354 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.460.73.01
I0830 11:16:34.905383 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.460.73.01
I0830 11:16:34.905413 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.460.73.01
I0830 11:16:34.905442 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.460.73.01
I0830 11:16:34.905483 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.460.73.01
I0830 11:16:34.905525 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.460.73.01
I0830 11:16:34.905555 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.73.01
I0830 11:16:34.905586 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.73.01
I0830 11:16:34.905625 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.460.73.01
I0830 11:16:34.905653 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.73.01
I0830 11:16:34.905694 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.460.73.01
I0830 11:16:34.906212 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.460.73.01
I0830 11:16:34.906391 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.460.73.01
I0830 11:16:34.906423 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.460.73.01
I0830 11:16:34.906454 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.460.73.01
I0830 11:16:34.906486 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.460.73.01
W0830 11:16:34.906560 49163 nvc_info.c:350] missing library libnvidia-nscq.so
W0830 11:16:34.906567 49163 nvc_info.c:350] missing library libnvidia-fatbinaryloader.so
W0830 11:16:34.906573 49163 nvc_info.c:350] missing library libvdpau_nvidia.so
W0830 11:16:34.906579 49163 nvc_info.c:354] missing compat32 library libnvidia-ml.so
W0830 11:16:34.906585 49163 nvc_info.c:354] missing compat32 library libnvidia-cfg.so
W0830 11:16:34.906592 49163 nvc_info.c:354] missing compat32 library libnvidia-nscq.so
W0830 11:16:34.906598 49163 nvc_info.c:354] missing compat32 library libcuda.so
W0830 11:16:34.906604 49163 nvc_info.c:354] missing compat32 library libnvidia-opencl.so
W0830 11:16:34.906610 49163 nvc_info.c:354] missing compat32 library libnvidia-ptxjitcompiler.so
W0830 11:16:34.906616 49163 nvc_info.c:354] missing compat32 library libnvidia-fatbinaryloader.so
W0830 11:16:34.906622 49163 nvc_info.c:354] missing compat32 library libnvidia-allocator.so
W0830 11:16:34.906628 49163 nvc_info.c:354] missing compat32 library libnvidia-compiler.so
W0830 11:16:34.906634 49163 nvc_info.c:354] missing compat32 library libnvidia-ngx.so
W0830 11:16:34.906640 49163 nvc_info.c:354] missing compat32 library libvdpau_nvidia.so
W0830 11:16:34.906646 49163 nvc_info.c:354] missing compat32 library libnvidia-encode.so
W0830 11:16:34.906652 49163 nvc_info.c:354] missing compat32 library libnvidia-opticalflow.so
W0830 11:16:34.906658 49163 nvc_info.c:354] missing compat32 library libnvcuvid.so
W0830 11:16:34.906664 49163 nvc_info.c:354] missing compat32 library libnvidia-eglcore.so
W0830 11:16:34.906670 49163 nvc_info.c:354] missing compat32 library libnvidia-glcore.so
W0830 11:16:34.906676 49163 nvc_info.c:354] missing compat32 library libnvidia-tls.so
W0830 11:16:34.906682 49163 nvc_info.c:354] missing compat32 library libnvidia-glsi.so
W0830 11:16:34.906688 49163 nvc_info.c:354] missing compat32 library libnvidia-fbc.so
W0830 11:16:34.906694 49163 nvc_info.c:354] missing compat32 library libnvidia-ifr.so
W0830 11:16:34.906700 49163 nvc_info.c:354] missing compat32 library libnvidia-rtcore.so
W0830 11:16:34.906706 49163 nvc_info.c:354] missing compat32 library libnvoptix.so
W0830 11:16:34.906712 49163 nvc_info.c:354] missing compat32 library libGLX_nvidia.so
W0830 11:16:34.906718 49163 nvc_info.c:354] missing compat32 library libEGL_nvidia.so
W0830 11:16:34.906729 49163 nvc_info.c:354] missing compat32 library libGLESv2_nvidia.so
W0830 11:16:34.906735 49163 nvc_info.c:354] missing compat32 library libGLESv1_CM_nvidia.so
W0830 11:16:34.906741 49163 nvc_info.c:354] missing compat32 library libnvidia-glvkspirv.so
W0830 11:16:34.906747 49163 nvc_info.c:354] missing compat32 library libnvidia-cbl.so
I0830 11:16:34.907014 49163 nvc_info.c:276] selecting /usr/bin/nvidia-smi
I0830 11:16:34.907032 49163 nvc_info.c:276] selecting /usr/bin/nvidia-debugdump
I0830 11:16:34.907049 49163 nvc_info.c:276] selecting /usr/bin/nvidia-persistenced
I0830 11:16:34.907074 49163 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-control
I0830 11:16:34.907090 49163 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-server
W0830 11:16:34.907173 49163 nvc_info.c:376] missing binary nv-fabricmanager
I0830 11:16:34.907195 49163 nvc_info.c:438] listing device /dev/nvidiactl
I0830 11:16:34.907201 49163 nvc_info.c:438] listing device /dev/nvidia-uvm
I0830 11:16:34.907207 49163 nvc_info.c:438] listing device /dev/nvidia-uvm-tools
I0830 11:16:34.907213 49163 nvc_info.c:438] listing device /dev/nvidia-modeset
I0830 11:16:34.907236 49163 nvc_info.c:317] listing ipc /run/nvidia-persistenced/socket
W0830 11:16:34.907256 49163 nvc_info.c:321] missing ipc /var/run/nvidia-fabricmanager/socket
W0830 11:16:34.907269 49163 nvc_info.c:321] missing ipc /tmp/nvidia-mps
I0830 11:16:34.907276 49163 nvc_info.c:733] requesting device information with ''
I0830 11:16:34.913002 49163 nvc_info.c:623] listing device /dev/nvidia0 (GPU-948211b6-df7a-5768-ca7b-a84e23d9404d at 00000000:01:00.0)
I0830 11:16:34.913062 49163 nvc_mount.c:344] mounting tmpfs at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/proc/driver/nvidia
I0830 11:16:34.913573 49163 nvc_mount.c:112] mounting /usr/bin/nvidia-smi at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/bin/nvidia-smi
I0830 11:16:34.913629 49163 nvc_mount.c:112] mounting /usr/bin/nvidia-debugdump at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/bin/nvidia-debugdump
I0830 11:16:34.913678 49163 nvc_mount.c:112] mounting /usr/bin/nvidia-persistenced at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/bin/nvidia-persistenced
I0830 11:16:34.913723 49163 nvc_mount.c:112] mounting /usr/bin/nvidia-cuda-mps-control at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/bin/nvidia-cuda-mps-control
I0830 11:16:34.913769 49163 nvc_mount.c:112] mounting /usr/bin/nvidia-cuda-mps-server at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/bin/nvidia-cuda-mps-server
I0830 11:16:34.913912 49163 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.73.01 at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.73.01
I0830 11:16:34.913967 49163 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.73.01 at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.73.01
I0830 11:16:34.914014 49163 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libcuda.so.460.73.01 at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/lib/x86_64-linux-gnu/libcuda.so.460.73.01
I0830 11:16:34.914060 49163 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.73.01 at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.73.01
I0830 11:16:34.914109 49163 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.73.01 at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.73.01
I0830 11:16:34.914165 49163 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.73.01 at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.73.01
I0830 11:16:34.914212 49163 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.73.01 at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.73.01
I0830 11:16:34.914241 49163 nvc_mount.c:524] creating symlink /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
I0830 11:16:34.914325 49163 nvc_mount.c:112] mounting /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/local/cuda-11.1/compat/libcuda.so.455.45.01 at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/lib/x86_64-linux-gnu/libcuda.so.455.45.01
I0830 11:16:34.914378 49163 nvc_mount.c:112] mounting /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/local/cuda-11.1/compat/libnvidia-ptxjitcompiler.so.455.45.01 at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.455.45.01
I0830 11:16:34.914499 49163 nvc_mount.c:239] mounting /run/nvidia-persistenced/socket at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/run/nvidia-persistenced/socket
I0830 11:16:34.914547 49163 nvc_mount.c:208] mounting /dev/nvidiactl at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/dev/nvidiactl
I0830 11:16:34.914582 49163 nvc_mount.c:499] whitelisting device node 195:255
I0830 11:16:34.914624 49163 nvc_mount.c:208] mounting /dev/nvidia-uvm at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/dev/nvidia-uvm
I0830 11:16:34.914649 49163 nvc_mount.c:499] whitelisting device node 508:0
I0830 11:16:34.914680 49163 nvc_mount.c:208] mounting /dev/nvidia-uvm-tools at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/dev/nvidia-uvm-tools
I0830 11:16:34.914704 49163 nvc_mount.c:499] whitelisting device node 508:1
I0830 11:16:34.914751 49163 nvc_mount.c:208] mounting /dev/nvidia0 at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/dev/nvidia0
I0830 11:16:34.914823 49163 nvc_mount.c:412] mounting /proc/driver/nvidia/gpus/0000:01:00.0 at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/proc/driver/nvidia/gpus/0000:01:00.0
I0830 11:16:34.914850 49163 nvc_mount.c:499] whitelisting device node 195:0
I0830 11:16:34.914869 49163 nvc_ldcache.c:360] executing /sbin/ldconfig.real from host at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged
I0830 11:16:34.963847 49163 nvc.c:423] shutting down library context
I0830 11:16:34.964461 49171 driver.c:163] terminating driver service
I0830 11:16:34.964805 49163 driver.c:203] driver service terminated successfully

Partial output of /var/log/nvidia-container-toolkit.log:

2021/08/30 19:14:07 No modification required
2021/08/30 19:14:07 Forwarding command to runtime
2021/08/30 19:14:07 Bundle directory path is empty, using working directory.
2021/08/30 19:14:07 Using bundle directory: /
2021/08/30 19:14:07 Using OCI specification file path: /config.json
2021/08/30 19:14:07 Looking for runtime binary 'docker-runc'
2021/08/30 19:14:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH
2021/08/30 19:14:07 Looking for runtime binary 'runc'
2021/08/30 19:14:07 Found runtime binary '/usr/bin/runc'
2021/08/30 19:14:07 Running /usr/bin/nvidia-container-runtime

2021/08/30 19:14:07 No modification required
2021/08/30 19:14:07 Forwarding command to runtime
2021/08/30 19:16:07 Using bundle directory: /run/containerd/io.containerd.runtime.v2.task/moby/19816aa85232e1fc7d31970489ccced5c68acbfc9f97d625ffc17387bb2e77fd
2021/08/30 19:16:07 Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/moby/19816aa85232e1fc7d31970489ccced5c68acbfc9f97d625ffc17387bb2e77fd/config.json
2021/08/30 19:16:07 Looking for runtime binary 'docker-runc'
2021/08/30 19:16:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH
2021/08/30 19:16:07 Looking for runtime binary 'runc'
2021/08/30 19:16:07 Found runtime binary '/usr/bin/runc'
2021/08/30 19:16:07 Running /usr/bin/nvidia-container-runtime

2021/08/30 19:16:07 'create' command detected; modification required
2021/08/30 19:16:07 prestart hook path: /usr/bin/nvidia-container-runtime-hook

2021/08/30 19:16:07 Forwarding command to runtime
2021/08/30 19:16:07 Bundle directory path is empty, using working directory.
2021/08/30 19:16:07 Using bundle directory: /run/containerd/io.containerd.runtime.v2.task/moby/19816aa85232e1fc7d31970489ccced5c68acbfc9f97d625ffc17387bb2e77fd
2021/08/30 19:16:07 Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/moby/19816aa85232e1fc7d31970489ccced5c68acbfc9f97d625ffc17387bb2e77fd/config.json
2021/08/30 19:16:07 Looking for runtime binary 'docker-runc'
2021/08/30 19:16:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH
2021/08/30 19:16:07 Looking for runtime binary 'runc'
2021/08/30 19:16:07 Found runtime binary '/usr/bin/runc'
2021/08/30 19:16:07 Running /usr/bin/nvidia-container-runtime

2021/08/30 19:16:07 No modification required
2021/08/30 19:16:07 Forwarding command to runtime
2021/08/30 19:16:07 Bundle directory path is empty, using working directory.
2021/08/30 19:16:07 Using bundle directory: /run/containerd/io.containerd.runtime.v2.task/moby/19816aa85232e1fc7d31970489ccced5c68acbfc9f97d625ffc17387bb2e77fd
2021/08/30 19:16:07 Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/moby/19816aa85232e1fc7d31970489ccced5c68acbfc9f97d625ffc17387bb2e77fd/config.json
2021/08/30 19:16:07 Looking for runtime binary 'docker-runc'
2021/08/30 19:16:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH
2021/08/30 19:16:07 Looking for runtime binary 'runc'
2021/08/30 19:16:07 Found runtime binary '/usr/bin/runc'
2021/08/30 19:16:07 Running /usr/bin/nvidia-container-runtime

2021/08/30 19:16:07 No modification required
2021/08/30 19:16:07 Forwarding command to runtime

Best regards, David

davidho27941 commented 3 years ago

@elezar

Hi,

Maybe my description got a little bit mist to you.

The recent status is:

I start a docker container using docker run --security-opt=no-new-privileges --cap-drop=ALL --restart always --network=none -dit -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:v0.9.0 to start a socket for k8s service (as the steps address in https://github.com/NVIDIA/k8s-device-plugin#with-docker).

This container can load NVML library successfully and a nvidia.com/gpu can be registered to k8s.

2021/08/31 05:06:32 Loading NVML
2021/08/31 05:06:32 Starting FS watcher.
2021/08/31 05:06:32 Starting OS watcher.
2021/08/31 05:06:32 Retreiving plugins.
2021/08/31 05:06:32 Starting GRPC server for 'nvidia.com/gpu'
2021/08/31 05:06:32 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2021/08/31 05:06:32 Registered device plugin for 'nvidia.com/gpu' with Kubelet

But the pod created for device plugin still cannot load the NVML library .

The output of kubectl logs nvidia-device-plugin-daemonset-4ddpg -n kube-system:

2021/08/31 06:46:42 Loading NVML
2021/08/31 06:46:42 Failed to initialize NVML: could not load NVML library.
2021/08/31 06:46:42 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2021/08/31 06:46:42 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2021/08/31 06:46:42 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2021/08/31 06:46:42 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes

Now, I can create a pod with the following config and without a 0/1 nodes are available: 1 Insufficient nvidia.com/gpu error message.

apiVersion: v1
kind: Pod
metadata:
  name: torch
  labels:
    app: torch
spec:
  containers:
  - name: torch
    image: nvcr.io/nvidia/pytorch:21.03-py3
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 30; done;" ]
    ports:
      - containerPort: 8888
        protocol: TCP
    resources:
      requests:
        nvidia.com/gpu: 1
        memory: "64Mi"
        cpu: "250m"
        ephemeral-storage: "5G"
      limits:
        nvidia.com/gpu: 1
        memory: "128Mi"
        cpu: "500m"
        ephemeral-storage: "10G"
    volumeMounts:
      - mountPath: "/data"
        name: test-volume
  volumes: 
    - name: test-volume
      hostPath: 
        path: "/home/david/jupyter_hub"
        type: Directory

But I cannot run nvidia-smi to fetch the gpu status and the command torch.cuda.is_available() also return a False that telling me it cannot fetch a gpu to run.

Do you have any idea about this?

Many thanks, David

elezar commented 3 years ago

@davidho27941 thanks for the additional information. You mentioned in your description that k8s is configured to launch containers using crio:

I was trying to create a cluster using crio container runtime interface and flannel CNI.

This means that crio needs to be configured to use the nvidia-container-runtime or have the nvidia-container-toolkit installed as a prestart hook. You also mentioned that the container failed to launch with containerd.

Could you repeat the command:

docker run --security-opt=no-new-privileges --cap-drop=ALL --restart always --network=none -dit -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:v0.9.0

using ctr instead of docker:

ctr run --rm -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:v0.9.0

And include the latest lines of /var/log/nvidia-container-*.log for the failed container.

davidho27941 commented 3 years ago

@davidho27941 thanks for the additional information. You mentioned in your description that k8s is configured to launch containers using crio:

I was trying to create a cluster using crio container runtime interface and flannel CNI.

This means that crio needs to be configured to use the nvidia-container-runtime or have the nvidia-container-toolkit installed as a prestart hook. You also mentioned that the container failed to launch with containerd.

Could you repeat the command:

docker run --security-opt=no-new-privileges --cap-drop=ALL --restart always --network=none -dit -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:v0.9.0

using ctr instead of docker:

ctr run --rm -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:v0.9.0

And include the latest lines of /var/log/nvidia-container-*.log for the failed container.

Hi @elezar ,

The recent configuration is running with contained. Base on your previous comment, I initialize my cluster with contained with the config shown in https://github.com/NVIDIA/k8s-device-plugin/issues/263#issuecomment-908247909

Many thanks, David

davidho27941 commented 3 years ago

@davidho27941 thanks for the additional information. You mentioned in your description that k8s is configured to launch containers using crio:

I was trying to create a cluster using crio container runtime interface and flannel CNI.

This means that crio needs to be configured to use the nvidia-container-runtime or have the nvidia-container-toolkit installed as a prestart hook. You also mentioned that the container failed to launch with containerd.

Could you repeat the command:

docker run --security-opt=no-new-privileges --cap-drop=ALL --restart always --network=none -dit -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:v0.9.0

using ctr instead of docker:

ctr run --rm -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:v0.9.0

And include the latest lines of /var/log/nvidia-container-*.log for the failed container.

Hi

I ran the command and got the following outputs.

The output of /var/logs/nvidia-container-runtime.log

2021/08/31 15:56:07 No modification required
2021/08/31 15:56:07 Forwarding command to runtime
2021/08/31 15:56:07 Bundle directory path is empty, using working directory.
2021/08/31 15:56:07 Using bundle directory: /
2021/08/31 15:56:07 Using OCI specification file path: /config.json
2021/08/31 15:56:07 Looking for runtime binary 'docker-runc'
2021/08/31 15:56:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH
2021/08/31 15:56:07 Looking for runtime binary 'runc'
2021/08/31 15:56:07 Found runtime binary '/usr/bin/runc'
2021/08/31 15:56:07 Running /usr/bin/nvidia-container-runtime

2021/08/31 15:56:07 No modification required
2021/08/31 15:56:07 Forwarding command to runtime
2021/08/31 15:56:07 Bundle directory path is empty, using working directory.
2021/08/31 15:56:07 Using bundle directory: /
2021/08/31 15:56:07 Using OCI specification file path: /config.json
2021/08/31 15:56:07 Looking for runtime binary 'docker-runc'
2021/08/31 15:56:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH
2021/08/31 15:56:07 Looking for runtime binary 'runc'
2021/08/31 15:56:07 Found runtime binary '/usr/bin/runc'
2021/08/31 15:56:07 Running /usr/bin/nvidia-container-runtime

2021/08/31 15:56:07 No modification required
2021/08/31 15:56:07 Forwarding command to runtime
2021/08/31 15:56:07 Bundle directory path is empty, using working directory.
2021/08/31 15:56:07 Using bundle directory: /
2021/08/31 15:56:07 Using OCI specification file path: /config.json
2021/08/31 15:56:07 Looking for runtime binary 'docker-runc'
2021/08/31 15:56:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH
2021/08/31 15:56:07 Looking for runtime binary 'runc'
2021/08/31 15:56:07 Found runtime binary '/usr/bin/runc'
2021/08/31 15:56:07 Running /usr/bin/nvidia-container-runtime

2021/08/31 15:56:07 No modification required
2021/08/31 15:56:07 Forwarding command to runtime
2021/08/31 15:56:07 Using bundle directory: /run/containerd/io.containerd.runtime.v1.linux/moby/ec776b001c4d50405a2611fbae9524865b2b134adbb75b22a19d57f1859c2ec6
2021/08/31 15:56:07 Using OCI specification file path: /run/containerd/io.containerd.runtime.v1.linux/moby/ec776b001c4d50405a2611fbae9524865b2b134adbb75b22a19d57f1859c2ec6/config.json
2021/08/31 15:56:07 Looking for runtime binary 'docker-runc'
2021/08/31 15:56:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH
2021/08/31 15:56:07 Looking for runtime binary 'runc'
2021/08/31 15:56:07 Found runtime binary '/usr/bin/runc'
2021/08/31 15:56:07 Running /usr/bin/nvidia-container-runtime

2021/08/31 15:56:07 'create' command detected; modification required
2021/08/31 15:56:07 prestart hook path: /usr/bin/nvidia-container-runtime-hook

2021/08/31 15:56:07 Forwarding command to runtime
2021/08/31 15:56:07 Bundle directory path is empty, using working directory.
2021/08/31 15:56:07 Using bundle directory: /run/containerd/io.containerd.runtime.v1.linux/moby/ec776b001c4d50405a2611fbae9524865b2b134adbb75b22a19d57f1859c2ec6
2021/08/31 15:56:07 Using OCI specification file path: /run/containerd/io.containerd.runtime.v1.linux/moby/ec776b001c4d50405a2611fbae9524865b2b134adbb75b22a19d57f1859c2ec6/config.json
2021/08/31 15:56:07 Looking for runtime binary 'docker-runc'
2021/08/31 15:56:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH
2021/08/31 15:56:07 Looking for runtime binary 'runc'
2021/08/31 15:56:07 Found runtime binary '/usr/bin/runc'
2021/08/31 15:56:07 Running /usr/bin/nvidia-container-runtime

2021/08/31 15:56:07 No modification required
2021/08/31 15:56:07 Forwarding command to runtime

The output of /var/log/nvidia-container-toolkit.log:


-- WARNING, the following logs are for debugging purposes only --

I0831 07:56:43.428069 74570 nvc.c:372] initializing library context (version=1.4.0, build=704a698b7a0ceec07a48e56c37365c741718c2df)
I0831 07:56:43.428116 74570 nvc.c:346] using root /
I0831 07:56:43.428125 74570 nvc.c:347] using ldcache /etc/ld.so.cache
I0831 07:56:43.428132 74570 nvc.c:348] using unprivileged user 65534:65534
I0831 07:56:43.428150 74570 nvc.c:389] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0831 07:56:43.428248 74570 nvc.c:391] dxcore initialization failed, continuing assuming a non-WSL environment
I0831 07:56:43.430218 74574 nvc.c:274] loading kernel module nvidia
I0831 07:56:43.430416 74574 nvc.c:278] running mknod for /dev/nvidiactl
I0831 07:56:43.430451 74574 nvc.c:282] running mknod for /dev/nvidia0
I0831 07:56:43.430474 74574 nvc.c:286] running mknod for all nvcaps in /dev/nvidia-caps
I0831 07:56:43.437363 74574 nvc.c:214] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I0831 07:56:43.437461 74574 nvc.c:214] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I0831 07:56:43.439428 74574 nvc.c:292] loading kernel module nvidia_uvm
I0831 07:56:43.439484 74574 nvc.c:296] running mknod for /dev/nvidia-uvm
I0831 07:56:43.439546 74574 nvc.c:301] loading kernel module nvidia_modeset
I0831 07:56:43.439598 74574 nvc.c:305] running mknod for /dev/nvidia-modeset
I0831 07:56:43.439788 74575 driver.c:101] starting driver service
I0831 07:56:43.442007 74570 nvc_container.c:388] configuring container with 'utility supervised'
I0831 07:56:43.442226 74570 nvc_container.c:408] setting pid to 74521
I0831 07:56:43.442236 74570 nvc_container.c:409] setting rootfs to /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged
I0831 07:56:43.442242 74570 nvc_container.c:410] setting owner to 0:0
I0831 07:56:43.442248 74570 nvc_container.c:411] setting bins directory to /usr/bin
I0831 07:56:43.442254 74570 nvc_container.c:412] setting libs directory to /usr/lib/x86_64-linux-gnu
I0831 07:56:43.442260 74570 nvc_container.c:413] setting libs32 directory to /usr/lib/i386-linux-gnu
I0831 07:56:43.442265 74570 nvc_container.c:414] setting cudart directory to /usr/local/cuda
I0831 07:56:43.442271 74570 nvc_container.c:415] setting ldconfig to @/sbin/ldconfig.real (host relative)
I0831 07:56:43.442277 74570 nvc_container.c:416] setting mount namespace to /proc/74521/ns/mnt
I0831 07:56:43.442283 74570 nvc_container.c:418] setting devices cgroup to /sys/fs/cgroup/devices/system.slice/docker-ec776b001c4d50405a2611fbae9524865b2b134adbb75b22a19d57f1859c2ec6.scope
I0831 07:56:43.442290 74570 nvc_info.c:676] requesting driver information with ''
I0831 07:56:43.443292 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.460.73.01
I0831 07:56:43.443339 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.460.73.01
I0831 07:56:43.443364 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.460.73.01
I0831 07:56:43.443390 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.73.01
I0831 07:56:43.443423 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.460.73.01
I0831 07:56:43.443457 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.73.01
I0831 07:56:43.443481 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.460.73.01
I0831 07:56:43.443504 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.73.01
I0831 07:56:43.443541 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.460.73.01
I0831 07:56:43.443576 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.460.73.01
I0831 07:56:43.443615 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.460.73.01
I0831 07:56:43.443638 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.460.73.01
I0831 07:56:43.443662 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.460.73.01
I0831 07:56:43.443696 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.460.73.01
I0831 07:56:43.443727 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.460.73.01
I0831 07:56:43.443749 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.73.01
I0831 07:56:43.443772 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.73.01
I0831 07:56:43.443802 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.460.73.01
I0831 07:56:43.443823 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.73.01
I0831 07:56:43.443855 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.460.73.01
I0831 07:56:43.444118 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.460.73.01
I0831 07:56:43.444252 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.460.73.01
I0831 07:56:43.444276 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.460.73.01
I0831 07:56:43.444299 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.460.73.01
I0831 07:56:43.444324 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.460.73.01
W0831 07:56:43.444383 74570 nvc_info.c:350] missing library libnvidia-nscq.so
W0831 07:56:43.444388 74570 nvc_info.c:350] missing library libnvidia-fatbinaryloader.so
W0831 07:56:43.444393 74570 nvc_info.c:350] missing library libvdpau_nvidia.so
W0831 07:56:43.444398 74570 nvc_info.c:354] missing compat32 library libnvidia-ml.so
W0831 07:56:43.444403 74570 nvc_info.c:354] missing compat32 library libnvidia-cfg.so
W0831 07:56:43.444407 74570 nvc_info.c:354] missing compat32 library libnvidia-nscq.so
W0831 07:56:43.444412 74570 nvc_info.c:354] missing compat32 library libcuda.so
W0831 07:56:43.444417 74570 nvc_info.c:354] missing compat32 library libnvidia-opencl.so
W0831 07:56:43.444421 74570 nvc_info.c:354] missing compat32 library libnvidia-ptxjitcompiler.so
W0831 07:56:43.444426 74570 nvc_info.c:354] missing compat32 library libnvidia-fatbinaryloader.so
W0831 07:56:43.444431 74570 nvc_info.c:354] missing compat32 library libnvidia-allocator.so
W0831 07:56:43.444435 74570 nvc_info.c:354] missing compat32 library libnvidia-compiler.so
W0831 07:56:43.444440 74570 nvc_info.c:354] missing compat32 library libnvidia-ngx.so
W0831 07:56:43.444445 74570 nvc_info.c:354] missing compat32 library libvdpau_nvidia.so
W0831 07:56:43.444449 74570 nvc_info.c:354] missing compat32 library libnvidia-encode.so
W0831 07:56:43.444454 74570 nvc_info.c:354] missing compat32 library libnvidia-opticalflow.so
W0831 07:56:43.444459 74570 nvc_info.c:354] missing compat32 library libnvcuvid.so
W0831 07:56:43.444463 74570 nvc_info.c:354] missing compat32 library libnvidia-eglcore.so
W0831 07:56:43.444468 74570 nvc_info.c:354] missing compat32 library libnvidia-glcore.so
W0831 07:56:43.444473 74570 nvc_info.c:354] missing compat32 library libnvidia-tls.so
W0831 07:56:43.444477 74570 nvc_info.c:354] missing compat32 library libnvidia-glsi.so
W0831 07:56:43.444482 74570 nvc_info.c:354] missing compat32 library libnvidia-fbc.so
W0831 07:56:43.444487 74570 nvc_info.c:354] missing compat32 library libnvidia-ifr.so
W0831 07:56:43.444491 74570 nvc_info.c:354] missing compat32 library libnvidia-rtcore.so
W0831 07:56:43.444496 74570 nvc_info.c:354] missing compat32 library libnvoptix.so
W0831 07:56:43.444501 74570 nvc_info.c:354] missing compat32 library libGLX_nvidia.so
W0831 07:56:43.444505 74570 nvc_info.c:354] missing compat32 library libEGL_nvidia.so
W0831 07:56:43.444510 74570 nvc_info.c:354] missing compat32 library libGLESv2_nvidia.so
W0831 07:56:43.444518 74570 nvc_info.c:354] missing compat32 library libGLESv1_CM_nvidia.so
W0831 07:56:43.444523 74570 nvc_info.c:354] missing compat32 library libnvidia-glvkspirv.so
W0831 07:56:43.444528 74570 nvc_info.c:354] missing compat32 library libnvidia-cbl.so
I0831 07:56:43.444737 74570 nvc_info.c:276] selecting /usr/bin/nvidia-smi
I0831 07:56:43.444750 74570 nvc_info.c:276] selecting /usr/bin/nvidia-debugdump
I0831 07:56:43.444764 74570 nvc_info.c:276] selecting /usr/bin/nvidia-persistenced
I0831 07:56:43.444783 74570 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-control
I0831 07:56:43.444796 74570 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-server
W0831 07:56:43.444862 74570 nvc_info.c:376] missing binary nv-fabricmanager
I0831 07:56:43.444880 74570 nvc_info.c:438] listing device /dev/nvidiactl
I0831 07:56:43.444885 74570 nvc_info.c:438] listing device /dev/nvidia-uvm
I0831 07:56:43.444889 74570 nvc_info.c:438] listing device /dev/nvidia-uvm-tools
I0831 07:56:43.444894 74570 nvc_info.c:438] listing device /dev/nvidia-modeset
I0831 07:56:43.444913 74570 nvc_info.c:317] listing ipc /run/nvidia-persistenced/socket
W0831 07:56:43.444929 74570 nvc_info.c:321] missing ipc /var/run/nvidia-fabricmanager/socket
W0831 07:56:43.444940 74570 nvc_info.c:321] missing ipc /tmp/nvidia-mps
I0831 07:56:43.444946 74570 nvc_info.c:733] requesting device information with ''
I0831 07:56:43.450627 74570 nvc_info.c:623] listing device /dev/nvidia0 (GPU-948211b6-df7a-5768-ca7b-a84e23d9404d at 00000000:01:00.0)
I0831 07:56:43.450668 74570 nvc_mount.c:344] mounting tmpfs at /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged/proc/driver/nvidia
I0831 07:56:43.451065 74570 nvc_mount.c:112] mounting /usr/bin/nvidia-smi at /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged/usr/bin/nvidia-smi
I0831 07:56:43.451106 74570 nvc_mount.c:112] mounting /usr/bin/nvidia-debugdump at /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged/usr/bin/nvidia-debugdump
I0831 07:56:43.451141 74570 nvc_mount.c:112] mounting /usr/bin/nvidia-persistenced at /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged/usr/bin/nvidia-persistenced
I0831 07:56:43.451248 74570 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.73.01 at /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.73.01
I0831 07:56:43.451289 74570 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.73.01 at /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.73.01
I0831 07:56:43.451380 74570 nvc_mount.c:239] mounting /run/nvidia-persistenced/socket at /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged/run/nvidia-persistenced/socket
I0831 07:56:43.451422 74570 nvc_mount.c:208] mounting /dev/nvidiactl at /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged/dev/nvidiactl
I0831 07:56:43.451446 74570 nvc_mount.c:499] whitelisting device node 195:255
I0831 07:56:43.451485 74570 nvc_mount.c:208] mounting /dev/nvidia0 at /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged/dev/nvidia0
I0831 07:56:43.451536 74570 nvc_mount.c:412] mounting /proc/driver/nvidia/gpus/0000:01:00.0 at /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged/proc/driver/nvidia/gpus/0000:01:00.0
I0831 07:56:43.451556 74570 nvc_mount.c:499] whitelisting device node 195:0
I0831 07:56:43.451569 74570 nvc_ldcache.c:360] executing /sbin/ldconfig.real from host at /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged
I0831 07:56:43.478246 74570 nvc.c:423] shutting down library context
I0831 07:56:43.478873 74575 driver.c:163] terminating driver service
I0831 07:56:43.479198 74570 driver.c:203] driver service terminated successfully
Mr-Linus commented 2 years ago

@davidho27941 I see from your description that you are installing version 1.0.0-beta4 of the device plugin:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

The versioning of the NVIDIA Device plugin is inconsistent in that v0.9.0 is the latest release: https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.9.0

Could see whether using this (or one of the more recent releases) addresses your issue?

Fixed, I update the image version to 1.0.0-beta4 and solved the problem. Thx.

elezar commented 2 years ago

@Mr-Linus note that 1.0.0-beta4 is not supported and v0.10.0 is the latest release. If you are experiencing problems with this release we should try to determine why this is.

Mr-Linus commented 2 years ago

@Mr-Linus note that 1.0.0-beta4 is not supported and v0.10.0 is the latest release. If you are experiencing problems with this release we should try to determine why this is.

👌🏻 Switched to v0.10.0 and it works fine.

luckyycode commented 2 years ago

Is there a way to run nvidia-container-runtime on io.containerd.runc.v2, not v1? I am getting the same error as OP, tried different versions of k8s-nvidia-plugin GPU on host node works fine, nvidia-smi outputs info

elezar commented 2 years ago

Is there a way to run nvidia-container-runtime on io.containerd.runc.v2, not v1? I am getting the same error as OP, tried different versions of k8s-nvidia-plugin GPU on host node works fine, nvidia-smi outputs info

@luckyycode this seems like an unrelated issue to this thread.

Note that from the config in https://github.com/NVIDIA/k8s-device-plugin/issues/263#issuecomment-908247909 we see:

     [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            SystemdCgroup = true
            BinaryName="/usr/bin/nvidia-container-runtime"

indicating the use of the v2 shim.

It may be more useful to create a new ticket. describing the behaviour that you see and including any relevant k8s or containerd information and logs.

elezar commented 2 years ago

@davidho27941 were you able to resolve your original issue?

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.