NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.56k stars 588 forks source link

Pod stuck 'pending', nvidia-device-plugin consuming 100% CPU #301

Open yankcrime opened 2 years ago

yankcrime commented 2 years ago

1. Issue or feature description

nvidia-device-plugin sits using 100% CPU when a new Pod with a GPU requirement is scheduled. The Pod is stuck as 'pending', with no further failure or error - either from the container or from Kubernetes itself.

Commands on the host such asnvidia-smi work prior to scheduling a Pod with a GPU requirement. Once this behaviour is triggered, I'm no longer able to run such commands until the host is rebooted.

2. Steps to reproduce the issue

Kubernetes cluster is K3s, version v1.22.9+k3s1. Cluster has seven nodes - three server, three worker, and a fourth worker with a pair of GPUs - A100s. All nodes are running Ubuntu 20.04 with kernel 5.4.0-109-generic. They're virtual machines, with the GPU VM being provided with the GPUs via PCI pass-through (see output below from nvidia-smi).

The GPU node has nvidia-container-toolkit version 1.9.0-1 installed along with nvidia-driver-470-server version 470.103.01-0ubuntu0.20.04.1.

Once the cluster is up, NFD is deployed with kubectl apply -k "https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.11.0"

With NFD, the device plugin is installed by templating the Helm install and adding a nodeSelector with the PCI device corresponding with node that has the GPUs:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: release-name-nvidia-device-plugin
  namespace: kube-system
  labels:
    helm.sh/chart: nvidia-device-plugin-0.11.0
    app.kubernetes.io/name: nvidia-device-plugin
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "0.11.0"
    app.kubernetes.io/managed-by: Helm
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: nvidia-device-plugin
      app.kubernetes.io/instance: release-name
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        app.kubernetes.io/name: nvidia-device-plugin
        app.kubernetes.io/instance: release-name
    spec:
      priorityClassName: "system-node-critical"
      nodeSelector:
        feature.node.kubernetes.io/pci-0302_10de.present: "true"
      runtimeClassName: nvidia
      securityContext:
        {}
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.11.0
        imagePullPolicy: IfNotPresent
        name: nvidia-device-plugin-ctr
        args:
        - "--mig-strategy=none"
        - "--pass-device-specs=false"
        - "--fail-on-init-error=true"
        - "--device-list-strategy=envvar"
        - "--device-id-strategy=uuid"
        - "--nvidia-driver-root=/"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
      tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
        - effect: NoSchedule
          key: nvidia.com/gpu
          operator: Exists

Once the plugin has deployed, the node is successfully updated to reflect the available GPUs:

$ kubectl get node sandbox-worker-gpu-instance1 -o jsonpath="{.status.allocatable}"
{"cpu":"32","ephemeral-storage":"39369928059","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"263940608Ki","nvidia.com/gpu":"2","pods":"110"}

Attempting to deploy a test Pod that targets this node then triggers the problem:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 2
$ kubectl get pods
NAME              READY   STATUS    RESTARTS   AGE
cuda-vector-add   0/1     Pending   0          2m45s
$ kubectl describe pod cuda-vector-add
Name:         cuda-vector-add
Namespace:    default
Priority:     0
Node:         sandbox-worker-gpu-instance1/
Labels:       <none>
Annotations:  <none>
Status:       Pending
IP:
IPs:          <none>
Containers:
  cuda-vector-add:
    Image:      k8s.gcr.io/cuda-vector-add:v0.1
    Port:       <none>
    Host Port:  <none>
    Limits:
      nvidia.com/gpu:  2
    Requests:
      nvidia.com/gpu:  2
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nnnzt (ro)
Conditions:
  Type           Status
  PodScheduled   True
Volumes:
  kube-api-access-nnnzt:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  2m56s  default-scheduler  Successfully assigned default/cuda-vector-add to sandbox-worker-gpu-instance1

$ kubectl get events | head -3
LAST SEEN   TYPE      REASON                    OBJECT                              MESSAGE
51m         Normal    Scheduled                 pod/cuda-vector-add                 Successfully assigned default/cuda-vector-add to sandbox-worker-gpu-instance1
4m18s       Normal    Scheduled                 pod/cuda-vector-add                 Successfully assigned default/cuda-vector-add to sandbox-worker-gpu-instance1

There are no additional logs from the nvidia-device-plugin container.

3. Information to attach (optional if deemed irrelevant)

Common error checking:

==============NVSMI LOG==============

Timestamp                                 : Thu May  5 12:29:57 2022
Driver Version                            : 470.103.01
CUDA Version                              : 11.4

Attached GPUs                             : 2
GPU 00000000:00:05.0
    Product Name                          : NVIDIA A100-SXM4-80GB
    Product Brand                         : NVIDIA
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1560221017003
    GPU UUID                              : GPU-6d777e6b-bc4d-212a-301c-82001966b4f0
    Minor Number                          : 0
    VBIOS Version                         : 92.00.36.00.10
    MultiGPU Board                        : No
    Board ID                              : 0x5
    GPU Part Number                       : 692-2G506-0212-002
    Module ID                             : 2
    Inforom Version
        Image Version                     : G506.0212.00.01
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Pass-Through
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x00
        Device                            : 0x05
        Domain                            : 0x0000
        Device Id                         : 0x20B210DE
        Bus Id                            : 00000000:00:05.0
        Sub System Id                     : 0x147F10DE
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 81251 MiB
        Used                              : 0 MiB
        Free                              : 81251 MiB
    BAR1 Memory Usage
        Total                             : 131072 MiB
        Used                              : 1 MiB
        Free                              : 131071 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 640 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 37 C
        GPU Shutdown Temp                 : 92 C
        GPU Slowdown Temp                 : 89 C
        GPU Max Operating Temp            : 85 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : 54 C
        Memory Max Operating Temp         : 95 C
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 73.30 W
        Power Limit                       : 500.00 W
        Default Power Limit               : 500.00 W
        Enforced Power Limit              : 500.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 500.00 W
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 1593 MHz
        Video                             : 585 MHz
    Applications Clocks
        Graphics                          : 1275 MHz
        Memory                            : 1593 MHz
    Default Applications Clocks
        Graphics                          : 1275 MHz
        Memory                            : 1593 MHz
    Max Clocks
        Graphics                          : 1410 MHz
        SM                                : 1410 MHz
        Memory                            : 1593 MHz
        Video                             : 1290 MHz
    Max Customer Boost Clocks
        Graphics                          : 1410 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 743.750 mV
    Processes                             : None

GPU 00000000:00:06.0
    Product Name                          : NVIDIA A100-SXM4-80GB
    Product Brand                         : NVIDIA
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1560221017105
    GPU UUID                              : GPU-1df39da7-ba1a-c950-de6d-394162582846
    Minor Number                          : 1
    VBIOS Version                         : 92.00.36.00.10
    MultiGPU Board                        : No
    Board ID                              : 0x6
    GPU Part Number                       : 692-2G506-0212-002
    Module ID                             : 3
    Inforom Version
        Image Version                     : G506.0212.00.01
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Pass-Through
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x00
        Device                            : 0x06
        Domain                            : 0x0000
        Device Id                         : 0x20B210DE
        Bus Id                            : 00000000:00:06.0
        Sub System Id                     : 0x147F10DE
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 81251 MiB
        Used                              : 0 MiB
        Free                              : 81251 MiB
    BAR1 Memory Usage
        Total                             : 131072 MiB
        Used                              : 1 MiB
        Free                              : 131071 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 640 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 31 C
        GPU Shutdown Temp                 : 92 C
        GPU Slowdown Temp                 : 89 C
        GPU Max Operating Temp            : 85 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : 48 C
        Memory Max Operating Temp         : 95 C
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 72.09 W
        Power Limit                       : 500.00 W
        Default Power Limit               : 500.00 W
        Enforced Power Limit              : 500.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 500.00 W
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 1593 MHz
        Video                             : 585 MHz
    Applications Clocks
        Graphics                          : 1275 MHz
        Memory                            : 1593 MHz
    Default Applications Clocks
        Graphics                          : 1275 MHz
        Memory                            : 1593 MHz
    Max Clocks
        Graphics                          : 1410 MHz
        SM                                : 1410 MHz
        Memory                            : 1593 MHz
        Video                             : 1290 MHz
    Max Customer Boost Clocks
        Graphics                          : 1410 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 750.000 mV
    Processes                             : None
[plugins.opt]
  path = "/var/lib/rancher/k3s/agent/containerd"

[plugins.cri]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  sandbox_image = "rancher/mirrored-pause:3.6"

[plugins.cri.containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true

[plugins.cri.cni]
  bin_dir = "/var/lib/rancher/k3s/data/995f5a281daabc1838b33f2346f7c4976b95f449c703b6f1f55b981966eba456/bin"
  conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"

[plugins.cri.containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins.cri.containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"
2022/05/05 12:29:36 Loading NVML
2022/05/05 12:29:45 Starting FS watcher.
2022/05/05 12:29:45 Starting OS watcher.
2022/05/05 12:29:45 Retreiving plugins.
2022/05/05 12:29:45 Starting GRPC server for 'nvidia.com/gpu'
2022/05/05 12:29:45 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2022/05/05 12:29:45 Registered device plugin for 'nvidia.com/gpu' with Kubelet
May 05 12:29:18 sandbox-worker-gpu-instance1 k3s[1106]: time="2022-05-05T12:29:18Z" level=info msg="Running kubelet --address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --cgroup-driver=cgroupfs --client-ca-file=/var/lib/rancher/k3s/agent/client-ca.crt --cloud-provider=external --cluster-dns=10.43.0.10 --cluster-domain=cluster.local --cni-bin-dir=/var/lib/rancher/k3s/data/995f5a281daabc1838b33f2346f7c4976b95f449c703b6f1f55b981966eba456/bin --cni-conf-dir=/var/lib/rancher/k3s/agent/etc/cni/net.d --container-runtime-endpoint=unix:///run/k3s/containerd/containerd.sock --container-runtime=remote --containerd=/run/k3s/containerd/containerd.sock --eviction-hard=imagefs.available<5%,nodefs.available<5% --eviction-minimum-reclaim=imagefs.available=10%,nodefs.available=10% --fail-swap-on=false --healthz-bind-address=127.0.0.1 --hostname-override=sandbox-worker-gpu-instance1 --kubeconfig=/var/lib/rancher/k3s/agent/kubelet.kubeconfig --node-labels= --pod-manifest-path=/var/lib/rancher/k3s/agent/pod-manifests --read-only-port=0 --resolv-conf=/run/systemd/resolve/resolv.conf --serialize-image-pulls=false --tls-cert-file=/var/lib/rancher/k3s/agent/serving-kubelet.crt --tls-private-key-file=/var/lib/rancher/k3s/agent/serving-kubelet.key"
May 05 12:29:18 sandbox-worker-gpu-instance1 k3s[1106]: Flag --cloud-provider has been deprecated, will be removed in 1.23, in favor of removing cloud provider code from Kubelet.
May 05 12:29:18 sandbox-worker-gpu-instance1 k3s[1106]: Flag --containerd has been deprecated, This is a cadvisor flag that was mistakenly registered with the Kubelet. Due to legacy concerns, it will follow the standard CLI deprecation timeline before being removed.
May 05 12:29:18 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:18.439787    1106 server.go:436] "Kubelet version" kubeletVersion="v1.22.9+k3s1"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.478082    1106 container_manager_linux.go:285] "Creating Container Manager object based on Node Config" nodeConfig={RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: ContainerRuntime:remote CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:cgroupfs KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: ReservedSystemCPUs: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[] SystemReserved:map[] HardEvictionThresholds:[{Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>} {Signal:imagefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>}]} QOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalCPUManagerPolicyOptions:map[] ExperimentalTopologyManagerScope:container ExperimentalCPUManagerReconcilePeriod:10s ExperimentalMemoryManagerPolicy:None ExperimentalMemoryManagerReservedMemory:[] ExperimentalPodPidsLimit:-1 EnforceCPULimits:true CPUCFSQuotaPeriod:100ms ExperimentalTopologyManagerPolicy:none}
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.479668    1106 kubelet.go:418] "Attempting to sync node with API server"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.480113    1106 kubelet.go:279] "Adding static pod path" path="/var/lib/rancher/k3s/agent/pod-manifests"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.480135    1106 kubelet.go:290] "Adding apiserver pod source"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.485114    1106 server.go:1213] "Started kubelet"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.486909    1106 volume_manager.go:291] "Starting Kubelet Volume Manager"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: E0505 12:29:23.487046    1106 kubelet.go:1343] "Image garbage collection failed once. Stats initialization may not have completed yet" err="invalid capacity 0 on image filesystem"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.499744    1106 server.go:409] "Adding debug handlers to kubelet server"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.533108    1106 kubelet_network_linux.go:56] "Initialized protocol iptables rules." protocol=IPv4
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.547350    1106 kubelet_network_linux.go:56] "Initialized protocol iptables rules." protocol=IPv6
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.547407    1106 kubelet.go:2006] "Starting kubelet main sync loop"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: E0505 12:29:23.547483    1106 kubelet.go:2030] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.568188    1106 plugin_manager.go:114] "Starting Kubelet Plugin Manager"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.588528    1106 kubelet_network.go:76] "Updating Pod CIDR" originalPodCIDR="" newPodCIDR="10.42.6.0/24"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.590636    1106 kubelet_node_status.go:71] "Attempting to register node" node="sandbox-worker-gpu-instance1"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.603663    1106 kubelet_node_status.go:109] "Node was previously registered" node="sandbox-worker-gpu-instance1"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.603744    1106 kubelet_node_status.go:74] "Successfully registered node" node="sandbox-worker-gpu-instance1"
I0505 13:17:23.667007 3793 rpc.c:71] starting nvcgo rpc service
I0505 13:17:23.668270 3723 nvc_container.c:240] configuring container with 'utility supervised'
I0505 13:17:23.669843 3723 nvc_container.c:262] setting pid to 3717
I0505 13:17:23.669861 3723 nvc_container.c:263] setting rootfs to /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs
I0505 13:17:23.669867 3723 nvc_container.c:264] setting owner to 0:0
I0505 13:17:23.669876 3723 nvc_container.c:265] setting bins directory to /usr/bin
I0505 13:17:23.669881 3723 nvc_container.c:266] setting libs directory to /usr/lib/x86_64-linux-gnu
I0505 13:17:23.669890 3723 nvc_container.c:267] setting libs32 directory to /usr/lib/i386-linux-gnu
I0505 13:17:23.669899 3723 nvc_container.c:268] setting cudart directory to /usr/local/cuda
I0505 13:17:23.669906 3723 nvc_container.c:269] setting ldconfig to @/sbin/ldconfig.real (host relative)
I0505 13:17:23.669912 3723 nvc_container.c:270] setting mount namespace to /proc/3717/ns/mnt
I0505 13:17:23.669919 3723 nvc_container.c:272] detected cgroupv1
I0505 13:17:23.669926 3723 nvc_container.c:273] setting devices cgroup to /sys/fs/cgroup/devices/kubepods/besteffort/pod9efe2b12-595a-4c81-aa50-591805794679/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282
I0505 13:17:23.669939 3723 nvc_info.c:765] requesting driver information with ''
I0505 13:17:23.671291 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.470.103.01
I0505 13:17:23.671348 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.470.103.01
I0505 13:17:23.671384 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.470.103.01
I0505 13:17:23.671420 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.103.01
I0505 13:17:23.671463 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.470.103.01
I0505 13:17:23.671505 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.470.103.01
I0505 13:17:23.671538 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.470.103.01
I0505 13:17:23.671572 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.103.01
I0505 13:17:23.671612 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.470.103.01
I0505 13:17:23.671654 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.470.103.01
I0505 13:17:23.671728 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.470.103.01
I0505 13:17:23.671761 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.470.103.01
I0505 13:17:23.671796 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.470.103.01
I0505 13:17:23.671842 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.470.103.01
I0505 13:17:23.671884 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.470.103.01
I0505 13:17:23.671920 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.470.103.01
I0505 13:17:23.671954 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.103.01
I0505 13:17:23.671994 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.470.103.01
I0505 13:17:23.672030 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.470.103.01
I0505 13:17:23.672073 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.470.103.01
I0505 13:17:23.672172 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.470.103.01
I0505 13:17:23.672247 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.470.103.01
I0505 13:17:23.672286 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.470.103.01
I0505 13:17:23.672324 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.470.103.01
I0505 13:17:23.672359 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.470.103.01
W0505 13:17:23.672381 3723 nvc_info.c:398] missing library libnvidia-nscq.so
W0505 13:17:23.672394 3723 nvc_info.c:398] missing library libnvidia-fatbinaryloader.so
W0505 13:17:23.672401 3723 nvc_info.c:398] missing library libnvidia-pkcs11.so
W0505 13:17:23.672408 3723 nvc_info.c:398] missing library libvdpau_nvidia.so
W0505 13:17:23.672415 3723 nvc_info.c:402] missing compat32 library libnvidia-ml.so
W0505 13:17:23.672421 3723 nvc_info.c:402] missing compat32 library libnvidia-cfg.so
W0505 13:17:23.672428 3723 nvc_info.c:402] missing compat32 library libnvidia-nscq.so
W0505 13:17:23.672434 3723 nvc_info.c:402] missing compat32 library libcuda.so
W0505 13:17:23.672443 3723 nvc_info.c:402] missing compat32 library libnvidia-opencl.so
W0505 13:17:23.672450 3723 nvc_info.c:402] missing compat32 library libnvidia-ptxjitcompiler.so
W0505 13:17:23.672456 3723 nvc_info.c:402] missing compat32 library libnvidia-fatbinaryloader.so
W0505 13:17:23.672464 3723 nvc_info.c:402] missing compat32 library libnvidia-allocator.so
W0505 13:17:23.672469 3723 nvc_info.c:402] missing compat32 library libnvidia-compiler.so
W0505 13:17:23.672474 3723 nvc_info.c:402] missing compat32 library libnvidia-pkcs11.so
W0505 13:17:23.672480 3723 nvc_info.c:402] missing compat32 library libnvidia-ngx.so
W0505 13:17:23.672486 3723 nvc_info.c:402] missing compat32 library libvdpau_nvidia.so
W0505 13:17:23.672490 3723 nvc_info.c:402] missing compat32 library libnvidia-encode.so
W0505 13:17:23.672496 3723 nvc_info.c:402] missing compat32 library libnvidia-opticalflow.so
W0505 13:17:23.672502 3723 nvc_info.c:402] missing compat32 library libnvcuvid.so
W0505 13:17:23.672508 3723 nvc_info.c:402] missing compat32 library libnvidia-eglcore.so
W0505 13:17:23.672513 3723 nvc_info.c:402] missing compat32 library libnvidia-glcore.so
W0505 13:17:23.672519 3723 nvc_info.c:402] missing compat32 library libnvidia-tls.so
W0505 13:17:23.672525 3723 nvc_info.c:402] missing compat32 library libnvidia-glsi.so
W0505 13:17:23.672529 3723 nvc_info.c:402] missing compat32 library libnvidia-fbc.so
W0505 13:17:23.672535 3723 nvc_info.c:402] missing compat32 library libnvidia-ifr.so
W0505 13:17:23.672541 3723 nvc_info.c:402] missing compat32 library libnvidia-rtcore.so
W0505 13:17:23.672548 3723 nvc_info.c:402] missing compat32 library libnvoptix.so
W0505 13:17:23.672554 3723 nvc_info.c:402] missing compat32 library libGLX_nvidia.so
W0505 13:17:23.672569 3723 nvc_info.c:402] missing compat32 library libEGL_nvidia.so
W0505 13:17:23.672575 3723 nvc_info.c:402] missing compat32 library libGLESv2_nvidia.so
W0505 13:17:23.672581 3723 nvc_info.c:402] missing compat32 library libGLESv1_CM_nvidia.so
W0505 13:17:23.672587 3723 nvc_info.c:402] missing compat32 library libnvidia-glvkspirv.so
W0505 13:17:23.672593 3723 nvc_info.c:402] missing compat32 library libnvidia-cbl.so
I0505 13:17:23.672920 3723 nvc_info.c:298] selecting /usr/bin/nvidia-smi
I0505 13:17:23.672941 3723 nvc_info.c:298] selecting /usr/bin/nvidia-debugdump
I0505 13:17:23.672957 3723 nvc_info.c:298] selecting /usr/bin/nvidia-persistenced
I0505 13:17:23.672981 3723 nvc_info.c:298] selecting /usr/bin/nvidia-cuda-mps-control
I0505 13:17:23.672998 3723 nvc_info.c:298] selecting /usr/bin/nvidia-cuda-mps-server
W0505 13:17:23.673094 3723 nvc_info.c:424] missing binary nv-fabricmanager
I0505 13:17:23.673120 3723 nvc_info.c:342] listing firmware path /usr/lib/firmware/nvidia/470.103.01/gsp.bin
I0505 13:17:23.673145 3723 nvc_info.c:528] listing device /dev/nvidiactl
I0505 13:17:23.673157 3723 nvc_info.c:528] listing device /dev/nvidia-uvm
I0505 13:17:23.673164 3723 nvc_info.c:528] listing device /dev/nvidia-uvm-tools
I0505 13:17:23.673170 3723 nvc_info.c:528] listing device /dev/nvidia-modeset
I0505 13:17:23.673191 3723 nvc_info.c:342] listing ipc path /run/nvidia-persistenced/socket
W0505 13:17:23.673211 3723 nvc_info.c:348] missing ipc path /var/run/nvidia-fabricmanager/socket
W0505 13:17:23.673230 3723 nvc_info.c:348] missing ipc path /tmp/nvidia-mps
I0505 13:17:23.673244 3723 nvc_info.c:821] requesting device information with ''
I0505 13:17:23.683518 3723 nvc_info.c:712] listing device /dev/nvidia0 (GPU-6d777e6b-bc4d-212a-301c-82001966b4f0 at 00000000:00:05.0)
I0505 13:17:23.689564 3723 nvc_info.c:712] listing device /dev/nvidia1 (GPU-1df39da7-ba1a-c950-de6d-394162582846 at 00000000:00:06.0)
I0505 13:17:23.689653 3723 nvc_mount.c:366] mounting tmpfs at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/proc/driver/nvidia
I0505 13:17:23.690086 3723 nvc_mount.c:134] mounting /usr/bin/nvidia-smi at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/usr/bin/nvidia-smi
I0505 13:17:23.690137 3723 nvc_mount.c:134] mounting /usr/bin/nvidia-debugdump at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/usr/bin/nvidia-debugdump
I0505 13:17:23.690178 3723 nvc_mount.c:134] mounting /usr/bin/nvidia-persistenced at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/usr/bin/nvidia-persistenced
I0505 13:17:23.690289 3723 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.103.01 at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.103.01
I0505 13:17:23.690333 3723 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.103.01 at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.103.01
I0505 13:17:23.690448 3723 nvc_mount.c:85] mounting /usr/lib/firmware/nvidia/470.103.01/gsp.bin at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/lib/firmware/nvidia/470.103.01/gsp.bin with flags 0x7
I0505 13:17:23.690520 3723 nvc_mount.c:261] mounting /run/nvidia-persistenced/socket at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/run/nvidia-persistenced/socket
I0505 13:17:23.690565 3723 nvc_mount.c:230] mounting /dev/nvidiactl at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/dev/nvidiactl
I0505 13:17:23.690853 3723 nvc_mount.c:230] mounting /dev/nvidia0 at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/dev/nvidia0
I0505 13:17:23.690914 3723 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:00:05.0 at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/proc/driver/nvidia/gpus/0000:00:05.0
I0505 13:17:23.691008 3723 nvc_mount.c:230] mounting /dev/nvidia1 at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/dev/nvidia1
I0505 13:17:23.691062 3723 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:00:06.0 at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/proc/driver/nvidia/gpus/0000:00:06.0
I0505 13:17:23.691150 3723 nvc_ldcache.c:372] executing /sbin/ldconfig.real from host at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs
I0505 13:17:23.720418 3723 nvc.c:430] shutting down library context
I0505 13:17:23.720546 3793 rpc.c:95] terminating nvcgo rpc service
I0505 13:17:23.721419 3723 rpc.c:135] nvcgo rpc service terminated successfully
I0505 13:17:24.227416 3730 rpc.c:95] terminating driver rpc service
I0505 13:17:24.227730 3723 rpc.c:135] driver rpc service terminated successfully
 2022/05/05 12:52:07 Using bundle directory:
2022/05/05 12:52:07 Using OCI specification file path: config.json
2022/05/05 12:52:07 Looking for runtime binary 'docker-runc'
2022/05/05 12:52:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH
2022/05/05 12:52:07 Looking for runtime binary 'runc'
2022/05/05 12:52:07 Found runtime binary '/var/lib/rancher/k3s/data/995f5a281daabc1838b33f2346f7c4976b95f449c703b6f1f55b981966eba456/bin/runc'
2022/05/05 12:52:07 Running /usr/bin/nvidia-container-runtime

2022/05/05 12:52:07 No modification required
2022/05/05 12:52:07 Forwarding command to runtime
TornjV commented 9 months ago

Did you manage to find a solution for this one @yankcrime ?

yankcrime commented 9 months ago

@TornjV Not directly no, the problem eventually went away with some combination of OS or driver update and I've not had it reoccur since.

TornjV commented 9 months ago

Same situation here, although not knowing why and if it could happen again is a bit annoying 😄

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.