NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.86k stars 635 forks source link

OutOfnvidia.com/gpu error appeared after "Instance terminated during maintenance" operation in GCE. "nvidia-container-cli: device error: unknown device id" #98

Closed dkozlov closed 8 months ago

dkozlov commented 5 years ago

1. Issue or feature description

OutOfnvidia.com/gpu error appeared when node is restarted in the GCE with https://github.com/NVIDIA/k8s-device-plugin after Instance terminated during maintenance operation appeared on the https://console.cloud.google.com/compute/operations page. GPU device id has been changed after GPU instance termination during maintenance so pod could not be started.

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=GPU-3187374f-4d6c-93e6-9c79-c5567b3d9202 --compute --utility --require=cuda>=10.0 brand=tesla,driver>=384,driver<385 --pid=5230 /var/lib/docker/overlay2/1319de685e623721c29cb7d6fe75ccca064c2be4b791952a545eab84829d5d83/merged]\\\\nnvidia-container-cli: device error: unknown device id: GPU-3187374f-4d6c-93e6-9c79-c5567b3d9202\\\\n\\\"\"": unknown
      Exit Code:    128
      Started:      Sat, 02 Mar 2019 06:59:44 +0000
      Finished:     Sat, 02 Mar 2019 06:59:44 +0000
    Ready:          False
    Restart Count:  569

It seems that after node termination device id has been changed (GPU has been replaced after maintenance)

2. Steps to reproduce the issue

Install kubernetes on GCE instances, install NVIDIA/k8s-device-plugin, Create pod with nvidia.com/gpu resource, wait for Instance terminated during maintenance operation operation appeared on the https://console.cloud.google.com/compute/operations page. Wait for gpu node restart, check pod status.

Common error checking:

==============NVSMI LOG==============

Timestamp : Sat Mar 2 12:57:28 2019 Driver Version : 410.79 CUDA Version : 10.0

Attached GPUs : 1 GPU 00000000:00:04.0 Product Name : Tesla K80 Product Brand : Tesla Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 0320617086523 GPU UUID : GPU-e05bb4a9-1b8c-1bb6-a3bc-59acddd70f54 Minor Number : 0 VBIOS Version : 80.21.25.00.01 MultiGPU Board : No Board ID : 0x4 GPU Part Number : 900-22080-6300-001 Inforom Version Image Version : 2080.0200.00.04 OEM Object : 1.1 ECC Object : 3.0 Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization mode : Pass-Through IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x00 Device : 0x04 Domain : 0x0000 Device Id : 0x102D10DE Bus Id : 00000000:00:04.0 Sub System Id : 0x106C10DE GPU Link Info PCIe Generation Max : 3 Current : 3 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays since reset : 0 Tx Throughput : N/A Rx Throughput : N/A Fan Speed : N/A Performance State : P0 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : N/A HW Power Brake Slowdown : N/A Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 11441 MiB Used : 6325 MiB Free : 5116 MiB BAR1 Memory Usage Total : 16384 MiB Used : 2 MiB Free : 16382 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : Enabled Pending : Enabled ECC Errors Volatile Single Bit
Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Texture Memory : 0 Texture Shared : N/A CBU : N/A Total : 0 Double Bit
Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Texture Memory : 0 Texture Shared : N/A CBU : N/A Total : 0 Aggregate Single Bit
Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Texture Memory : 0 Texture Shared : N/A CBU : N/A Total : 0 Double Bit
Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Texture Memory : 0 Texture Shared : N/A CBU : N/A Total : 0 Retired Pages Single Bit ECC : 0 Double Bit ECC : 0 Pending : No Temperature GPU Current Temp : 33 C GPU Shutdown Temp : 93 C GPU Slowdown Temp : 88 C GPU Max Operating Temp : N/A Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 62.27 W Power Limit : 149.00 W Default Power Limit : 149.00 W Enforced Power Limit : 149.00 W Min Power Limit : 100.00 W Max Power Limit : 175.00 W Clocks Graphics : 771 MHz SM : 771 MHz Memory : 2505 MHz Video : 540 MHz Applications Clocks Graphics : 562 MHz Memory : 2505 MHz Default Applications Clocks Graphics : 562 MHz Memory : 2505 MHz Max Clocks Graphics : 875 MHz SM : 875 MHz Memory : 2505 MHz Video : 540 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : On Auto Boost Default : On Processes Process ID : 18340 Type : C Name : test Used GPU Memory : 6312 MiB

 - [x] Your docker configuration file (e.g: `/etc/docker/daemon.json`)
```{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
2019/03/02 05:23:55 Registered device plugin with Kubelet
2019/03/02 06:11:23 Received signal "terminated", shutting down.
2019/03/02 06:11:27 Shutdown of NVML returned: <nil>

Server: Docker Engine - Community Engine: Version: 18.09.2 API version: 1.39 (minimum version 1.12) Go version: go1.10.6 Git commit: 6247962 Built: Sun Feb 10 03:42:13 2019 OS/Arch: linux/amd64 Experimental: false

 - [x] Kernel version from `uname -a`

Linux prod-feb-slave-gpu-instance-0 4.15.0-1027-gcp #28-Ubuntu SMP Tue Jan 15 12:29:48 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

 - [x] Any relevant kernel output lines from `dmesg`
```dmesg | grep nvid
[    1.604147] nvidia: loading out-of-tree module taints kernel.
[    1.605140] nvidia: module license 'NVIDIA' taints kernel.
[    1.613475] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    1.623530] nvidia-nvlink: Nvlink Core is being initialized, major device number 243
[    1.660573] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  410.79  Thu Nov 15 10:39:32 CST 2018
[    1.664385] [drm] [nvidia-drm] [GPU ID 0x00000004] Loading driver
[    1.665424] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:04.0 on minor 0
[    5.583049] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 241
dkozlov commented 5 years ago

https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/a1a84791883d8df6f00197efbc7d09a4eeeb9fea/pkg/gpu/nvidia/manager.go#L84

https://github.com/NVIDIA/k8s-device-plugin/blob/b06bf4828f8fee1d6a6d8e2a43f37ac1e1c31bfe/nvidia.go#L30

Probably ID: nvidia0 instead of ID: GPU-3187374f-4d6c-93e6-9c79-c5567b3d9202 will solve this problem

RenaudWasTaken commented 5 years ago

I'll look a bit more into this, but this seems pretty much like a kubernetes / GCE bug. One workaround seems to be passing down "relative devices" but as soon as you look at this from a topology optimization angle this workaround breaks...

dkozlov commented 5 years ago

I think it is also NVIDIA/k8s-device-plugin problem because the same issue could happen on any k8s installation if one of gpu goes down with host restart and pod related to crashed gpu will not be automatically restarted on the another healthy gpu. Probably you could reproduce this issue on bare metal sever by following steps: 1) stop one of gpu-node with attached gpu-pod 2) physically remove one of two gpu devices 3) start the node from step 1) 4) check out the status of gpu-pod (it should be successfully restarted)

siom79 commented 5 years ago

@dkozlov I have just seen this error on Microsoft Azure after having stopped and started a VM of type Standard_NV6 (Nvidia Tesla M60). Stopping a VM deallocates the VM from the underlying hardware and starting it may run it on a different physical machine with a different GPU device id.

Error: failed to start container "xxx": Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=GPU-26eab6dd-af7b-a940-128d-198b027572bb --compute --utility --require=cuda>=8.0 --pid=26495 /disks/dev/sdc1/aufs/mnt/b850c39687b59a3299c404dff7c235c2e07ae0adfd379d0da2f0b8e3fe87caa6]\\\\nnvidia-container-cli: device error: unknown device id: GPU-26eab6dd-af7b-a940-128d-198b027572bb\\\\n\\\"\"": unknown

nvidia-docker version:

NVIDIA Docker: 2.0.3
Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:24:51 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:23:15 2018
  OS/Arch:          linux/amd64
  Experimental:     false
RenaudWasTaken commented 5 years ago

I've looked a bit more into this, and I don't see this scenario being supported any time soon. As was mentioned above you are physically removing the GPU.

Kubernetes is statically assigning the GPU to the containers, at this point there is nothing that can be done.

dkozlov commented 5 years ago

@RenaudWasTaken so could you please confirm that NVIDIA/k8s-device-plugin does not support at least Google Cloud and Azure? Because with https://github.com/GoogleCloudPlatform/container-engine-accelerators "OutOfnvidia.com/gpu" issue not appears. As workaround you could use device file name instead of device UUID, see https://github.com/NVIDIA/k8s-device-plugin/issues/98#issuecomment-468928387

siom79 commented 5 years ago

@RenaudWasTaken I don't think that this is a basic kubernetes issue, as the Feature Gate Accelerators (see here) that we have used until kubernetes 1.10 worked like a charm on Azure. You could stop and start VMs without any problem, even if they got scheduled on different hardware.

siom79 commented 5 years ago

I had a quick look a the sources. Here the device arg of nvidia-container-cli gets set. The Devices string used there is set here. As getDevices() uses the environment variable NVIDIA_VISIBLE_DEVICES, setting this variable as suggested here works around this problem:

apiVersion: extensions/v1beta1
kind: Deployment
spec:
    spec:
      containers:
          env:
            - name: NVIDIA_VISIBLE_DEVICES
              value: "all"
github-actions[bot] commented 9 months ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] commented 8 months ago

This issue was automatically closed due to inactivity.