Closed dkozlov closed 8 months ago
Probably ID: nvidia0
instead of ID: GPU-3187374f-4d6c-93e6-9c79-c5567b3d9202
will solve this problem
I'll look a bit more into this, but this seems pretty much like a kubernetes / GCE bug. One workaround seems to be passing down "relative devices" but as soon as you look at this from a topology optimization angle this workaround breaks...
I think it is also NVIDIA/k8s-device-plugin problem because the same issue could happen on any k8s installation if one of gpu goes down with host restart and pod related to crashed gpu will not be automatically restarted on the another healthy gpu. Probably you could reproduce this issue on bare metal sever by following steps: 1) stop one of gpu-node with attached gpu-pod 2) physically remove one of two gpu devices 3) start the node from step 1) 4) check out the status of gpu-pod (it should be successfully restarted)
@dkozlov I have just seen this error on Microsoft Azure after having stopped and started a VM of type Standard_NV6 (Nvidia Tesla M60). Stopping a VM deallocates the VM from the underlying hardware and starting it may run it on a different physical machine with a different GPU device id.
Error: failed to start container "xxx": Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=GPU-26eab6dd-af7b-a940-128d-198b027572bb --compute --utility --require=cuda>=8.0 --pid=26495 /disks/dev/sdc1/aufs/mnt/b850c39687b59a3299c404dff7c235c2e07ae0adfd379d0da2f0b8e3fe87caa6]\\\\nnvidia-container-cli: device error: unknown device id: GPU-26eab6dd-af7b-a940-128d-198b027572bb\\\\n\\\"\"": unknown
nvidia-docker version:
NVIDIA Docker: 2.0.3
Client:
Version: 18.06.1-ce
API version: 1.38
Go version: go1.10.3
Git commit: e68fc7a
Built: Tue Aug 21 17:24:51 2018
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 18.06.1-ce
API version: 1.38 (minimum version 1.12)
Go version: go1.10.3
Git commit: e68fc7a
Built: Tue Aug 21 17:23:15 2018
OS/Arch: linux/amd64
Experimental: false
I've looked a bit more into this, and I don't see this scenario being supported any time soon. As was mentioned above you are physically removing the GPU.
Kubernetes is statically assigning the GPU to the containers, at this point there is nothing that can be done.
@RenaudWasTaken so could you please confirm that NVIDIA/k8s-device-plugin does not support at least Google Cloud and Azure? Because with https://github.com/GoogleCloudPlatform/container-engine-accelerators "OutOfnvidia.com/gpu" issue not appears. As workaround you could use device file name instead of device UUID, see https://github.com/NVIDIA/k8s-device-plugin/issues/98#issuecomment-468928387
@RenaudWasTaken I don't think that this is a basic kubernetes issue, as the Feature Gate Accelerators
(see here) that we have used until kubernetes 1.10 worked like a charm on Azure. You could stop and start VMs without any problem, even if they got scheduled on different hardware.
I had a quick look a the sources. Here the device
arg of nvidia-container-cli
gets set. The Devices
string used there is set here. As getDevices()
uses the environment variable NVIDIA_VISIBLE_DEVICES
, setting this variable as suggested here works around this problem:
apiVersion: extensions/v1beta1
kind: Deployment
spec:
spec:
containers:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
This issue was automatically closed due to inactivity.
1. Issue or feature description
OutOfnvidia.com/gpu
error appeared when node is restarted in the GCE with https://github.com/NVIDIA/k8s-device-plugin afterInstance terminated during maintenance
operation appeared on the https://console.cloud.google.com/compute/operations page. GPU device id has been changed after GPU instance termination during maintenance so pod could not be started.It seems that after node termination device id has been changed (GPU has been replaced after maintenance)
2. Steps to reproduce the issue
Install kubernetes on GCE instances, install NVIDIA/k8s-device-plugin, Create pod with nvidia.com/gpu resource, wait for
Instance terminated during maintenance operation
operation appeared on the https://console.cloud.google.com/compute/operations page. Wait for gpu node restart, check pod status.Common error checking:
nvidia-smi -a
on your host==============NVSMI LOG==============
Timestamp : Sat Mar 2 12:57:28 2019 Driver Version : 410.79 CUDA Version : 10.0
Attached GPUs : 1 GPU 00000000:00:04.0 Product Name : Tesla K80 Product Brand : Tesla Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 0320617086523 GPU UUID : GPU-e05bb4a9-1b8c-1bb6-a3bc-59acddd70f54 Minor Number : 0 VBIOS Version : 80.21.25.00.01 MultiGPU Board : No Board ID : 0x4 GPU Part Number : 900-22080-6300-001 Inforom Version Image Version : 2080.0200.00.04 OEM Object : 1.1 ECC Object : 3.0 Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization mode : Pass-Through IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x00 Device : 0x04 Domain : 0x0000 Device Id : 0x102D10DE Bus Id : 00000000:00:04.0 Sub System Id : 0x106C10DE GPU Link Info PCIe Generation Max : 3 Current : 3 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays since reset : 0 Tx Throughput : N/A Rx Throughput : N/A Fan Speed : N/A Performance State : P0 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : N/A HW Power Brake Slowdown : N/A Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 11441 MiB Used : 6325 MiB Free : 5116 MiB BAR1 Memory Usage Total : 16384 MiB Used : 2 MiB Free : 16382 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : Enabled Pending : Enabled ECC Errors Volatile Single Bit
Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Texture Memory : 0 Texture Shared : N/A CBU : N/A Total : 0 Double Bit
Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Texture Memory : 0 Texture Shared : N/A CBU : N/A Total : 0 Aggregate Single Bit
Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Texture Memory : 0 Texture Shared : N/A CBU : N/A Total : 0 Double Bit
Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Texture Memory : 0 Texture Shared : N/A CBU : N/A Total : 0 Retired Pages Single Bit ECC : 0 Double Bit ECC : 0 Pending : No Temperature GPU Current Temp : 33 C GPU Shutdown Temp : 93 C GPU Slowdown Temp : 88 C GPU Max Operating Temp : N/A Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 62.27 W Power Limit : 149.00 W Default Power Limit : 149.00 W Enforced Power Limit : 149.00 W Min Power Limit : 100.00 W Max Power Limit : 175.00 W Clocks Graphics : 771 MHz SM : 771 MHz Memory : 2505 MHz Video : 540 MHz Applications Clocks Graphics : 562 MHz Memory : 2505 MHz Default Applications Clocks Graphics : 562 MHz Memory : 2505 MHz Max Clocks Graphics : 875 MHz SM : 875 MHz Memory : 2505 MHz Video : 540 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : On Auto Boost Default : On Processes Process ID : 18340 Type : C Name : test Used GPU Memory : 6312 MiB
sudo journalctl -r -u kubelet
)Additional information that might help better understand your environment and reproduce the bug:
docker version
Server: Docker Engine - Community Engine: Version: 18.09.2 API version: 1.39 (minimum version 1.12) Go version: go1.10.6 Git commit: 6247962 Built: Sun Feb 10 03:42:13 2019 OS/Arch: linux/amd64 Experimental: false
Linux prod-feb-slave-gpu-instance-0 4.15.0-1027-gcp #28-Ubuntu SMP Tue Jan 15 12:29:48 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
nvidia-container-cli -V