NVIDIA / kubevirt-gpu-device-plugin

NVIDIA k8s device plugin for Kubevirt
BSD 3-Clause "New" or "Revised" License
209 stars 66 forks source link

Fix errors due to mismatch of GLIBC version caused from Go 1.20+ #82

Closed shivamerla closed 7 months ago

shivamerla commented 8 months ago

Fix errors due to mismatch of GLIBC version caused from Go 1.20+ when the binary is built with glibc version different than the target image. This is happening since Go version is updated with commit: 24751abfcfe94ec46a549c63af0f7c85f2f728d9. To avoid this, we are installing Go with the same base image as the target one.

Errors:

kubevirt-gpu-device-plugin: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by kubevirt-gpu-device-plugin)
kubevirt-gpu-device-plugin: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by kubevirt-gpu-device-plugin)
shivamerla commented 8 months ago

Tested with vGPU devices

Allocatable:
  cpu:                            80
  devices.kubevirt.io/kvm:        1k
  devices.kubevirt.io/tun:        1k
  devices.kubevirt.io/vhost-net:  1k
  ephemeral-storage:              189217404206
  hugepages-1Gi:                  0
  hugepages-2Mi:                  0
  memory:                         394654312Ki
  nvidia.com/A10-12Q:             0
  nvidia.com/GA102GL_A10:         0
  nvidia.com/NVIDIA_A10-12Q:      0
  nvidia.com/NVIDIA_A10-8Q:       6
  nvidia.com/gpu:                 0
  pods:                           110
cnt-dev@cnt-server-2:~$ kubectl logs -f nvidia-sandbox-device-plugin-daemonset-zxjj8 -n gpu-operator
Defaulted container "nvidia-sandbox-device-plugin-ctr" out of: nvidia-sandbox-device-plugin-ctr, vfio-pci-validation (init), vgpu-devices-validation (init)
2023/11/07 21:31:46 Not a device, continuing
2023/11/07 21:31:46 Nvidia device  0000:3b:00.0
2023/11/07 21:31:46 Nvidia device  0000:3b:00.4
2023/11/07 21:31:46 Nvidia device  0000:3b:00.5
2023/11/07 21:31:46 Nvidia device  0000:3b:00.6
2023/11/07 21:31:46 Nvidia device  0000:3b:00.7
2023/11/07 21:31:46 Nvidia device  0000:3b:01.0
2023/11/07 21:31:46 Nvidia device  0000:3b:01.1
2023/11/07 21:31:46 Nvidia device  0000:3b:01.2
2023/11/07 21:31:46 Nvidia device  0000:3b:01.3
2023/11/07 21:31:46 Nvidia device  0000:3b:01.4
2023/11/07 21:31:46 Nvidia device  0000:3b:01.5
2023/11/07 21:31:46 Nvidia device  0000:3b:01.6
2023/11/07 21:31:46 Nvidia device  0000:3b:01.7
2023/11/07 21:31:46 Nvidia device  0000:3b:02.0
2023/11/07 21:31:46 Nvidia device  0000:3b:02.1
2023/11/07 21:31:46 Nvidia device  0000:3b:02.2
2023/11/07 21:31:46 Nvidia device  0000:3b:02.3
2023/11/07 21:31:46 Nvidia device  0000:3b:02.4
2023/11/07 21:31:46 Nvidia device  0000:3b:02.5
2023/11/07 21:31:46 Nvidia device  0000:3b:02.6
2023/11/07 21:31:46 Nvidia device  0000:3b:02.7
2023/11/07 21:31:46 Nvidia device  0000:3b:03.0
2023/11/07 21:31:46 Nvidia device  0000:3b:03.1
2023/11/07 21:31:46 Nvidia device  0000:3b:03.2
2023/11/07 21:31:46 Nvidia device  0000:3b:03.3
2023/11/07 21:31:46 Nvidia device  0000:3b:03.4
2023/11/07 21:31:46 Nvidia device  0000:3b:03.5
2023/11/07 21:31:46 Nvidia device  0000:3b:03.6
2023/11/07 21:31:46 Nvidia device  0000:3b:03.7
2023/11/07 21:31:46 Nvidia device  0000:3b:04.0
2023/11/07 21:31:46 Nvidia device  0000:3b:04.1
2023/11/07 21:31:46 Nvidia device  0000:3b:04.2
2023/11/07 21:31:46 Nvidia device  0000:3b:04.3
2023/11/07 21:31:46 Nvidia device  0000:86:00.0
2023/11/07 21:31:46 Nvidia device  0000:86:00.4
2023/11/07 21:31:46 Nvidia device  0000:86:00.5
2023/11/07 21:31:46 Nvidia device  0000:86:00.6
2023/11/07 21:31:46 Nvidia device  0000:86:00.7
2023/11/07 21:31:46 Nvidia device  0000:86:01.0
2023/11/07 21:31:46 Nvidia device  0000:86:01.1
2023/11/07 21:31:46 Nvidia device  0000:86:01.2
2023/11/07 21:31:46 Nvidia device  0000:86:01.3
2023/11/07 21:31:46 Nvidia device  0000:86:01.4
2023/11/07 21:31:46 Nvidia device  0000:86:01.5
2023/11/07 21:31:46 Nvidia device  0000:86:01.6
2023/11/07 21:31:46 Nvidia device  0000:86:01.7
2023/11/07 21:31:46 Nvidia device  0000:86:02.0
2023/11/07 21:31:46 Nvidia device  0000:86:02.1
2023/11/07 21:31:46 Nvidia device  0000:86:02.2
2023/11/07 21:31:46 Nvidia device  0000:86:02.3
2023/11/07 21:31:46 Nvidia device  0000:86:02.4
2023/11/07 21:31:46 Nvidia device  0000:86:02.5
2023/11/07 21:31:46 Nvidia device  0000:86:02.6
2023/11/07 21:31:46 Nvidia device  0000:86:02.7
2023/11/07 21:31:46 Nvidia device  0000:86:03.0
2023/11/07 21:31:46 Nvidia device  0000:86:03.1
2023/11/07 21:31:46 Nvidia device  0000:86:03.2
2023/11/07 21:31:46 Nvidia device  0000:86:03.3
2023/11/07 21:31:46 Nvidia device  0000:86:03.4
2023/11/07 21:31:46 Nvidia device  0000:86:03.5
2023/11/07 21:31:46 Nvidia device  0000:86:03.6
2023/11/07 21:31:46 Nvidia device  0000:86:03.7
2023/11/07 21:31:46 Nvidia device  0000:86:04.0
2023/11/07 21:31:46 Nvidia device  0000:86:04.1
2023/11/07 21:31:46 Nvidia device  0000:86:04.2
2023/11/07 21:31:46 Nvidia device  0000:86:04.3
2023/11/07 21:31:46 Not a device, continuing
2023/11/07 21:31:46 Gpu id is 0000:86:00.6
2023/11/07 21:31:46 Vgpu id is NVIDIA_A10-8Q
2023/11/07 21:31:46 Gpu id is 0000:3b:00.6
2023/11/07 21:31:46 Vgpu id is NVIDIA_A10-8Q
2023/11/07 21:31:46 Gpu id is 0000:3b:00.4
2023/11/07 21:31:46 Vgpu id is NVIDIA_A10-8Q
2023/11/07 21:31:46 Gpu id is 0000:86:00.5
2023/11/07 21:31:46 Vgpu id is NVIDIA_A10-8Q
2023/11/07 21:31:46 Gpu id is 0000:3b:00.5
2023/11/07 21:31:46 Vgpu id is NVIDIA_A10-8Q
2023/11/07 21:31:46 Gpu id is 0000:86:00.4
2023/11/07 21:31:46 Vgpu id is NVIDIA_A10-8Q
2023/11/07 21:31:46 Iommu Map map[]
2023/11/07 21:31:46 Device Map map[]
2023/11/07 21:31:46 vGPU Map  map[NVIDIA_A10-8Q:[{32e52bb7-29d9-45e7-8aec-b1f15dbcf887} {3c0b19b0-2355-4026-9bc0-7bc2bfad5b79} {5e535b46-ca45-4cb7-b8b6-169791609fc6} {8437039c-1751-4c83-b3d7-32f45928186b} {9b3bda2c-631b-4de4-8aa7-3e5c8c0c505d} {d808afcf-b924-42fe-b29b-781507a3ba52}]]
2023/11/07 21:31:46 GPU vGPU Map  map[0000:3b:00.4:[5e535b46-ca45-4cb7-b8b6-169791609fc6] 0000:3b:00.5:[9b3bda2c-631b-4de4-8aa7-3e5c8c0c505d] 0000:3b:00.6:[3c0b19b0-2355-4026-9bc0-7bc2bfad5b79] 0000:86:00.4:[d808afcf-b924-42fe-b29b-781507a3ba52] 0000:86:00.5:[8437039c-1751-4c83-b3d7-32f45928186b] 0000:86:00.6:[32e52bb7-29d9-45e7-8aec-b1f15dbcf887]]
2023/11/07 21:31:46 Could not find NVIDIA device with id: NVIDIA_A10-8Q
2023/11/07 21:31:46 DP Name NVIDIA_A10-8Q
2023/11/07 21:31:46 Devicename NVIDIA_A10-8Q
2023/11/07 21:31:46 NVIDIA_A10-8Q Device plugin server ready
2023/11/07 21:31:46 healthCheck(NVIDIA_A10-8Q): invoked
2023/11/07 21:31:46 healthCheck(NVIDIA_A10-8Q): Loading NVML
2023/11/07 21:31:46 healthCheck(NVIDIA_A10-8Q): Failed to initialize NVML: could not load NVML library
2023/11/07 21:31:46 healthCheck(NVIDIA_A10-8Q): Adding watch for device path: /sys/bus/mdev/devices/32e52bb7-29d9-45e7-8aec-b1f15dbcf887
2023/11/07 21:31:46 healthCheck(NVIDIA_A10-8Q): Adding watch for device path: /sys/bus/mdev/devices/3c0b19b0-2355-4026-9bc0-7bc2bfad5b79
2023/11/07 21:31:46 healthCheck(NVIDIA_A10-8Q): Adding watch for device path: /sys/bus/mdev/devices/5e535b46-ca45-4cb7-b8b6-169791609fc6
2023/11/07 21:31:46 healthCheck(NVIDIA_A10-8Q): Adding watch for device path: /sys/bus/mdev/devices/8437039c-1751-4c83-b3d7-32f45928186b
2023/11/07 21:31:46 healthCheck(NVIDIA_A10-8Q): Adding watch for device path: /sys/bus/mdev/devices/9b3bda2c-631b-4de4-8aa7-3e5c8c0c505d
2023/11/07 21:31:46 healthCheck(NVIDIA_A10-8Q): Adding watch for device path: /sys/bus/mdev/devices/d808afcf-b924-42fe-b29b-781507a3ba52
cdesiniotis commented 8 months ago

Thanks @shivamerla. We are avoiding the potential glibc version mismatch by installing go + building go binaries in the target base image (in this case the CUDA base image). This aligns with how we build our other containers. For example, k8s-device-plugin: https://gitlab.com/nvidia/kubernetes/device-plugin/-/blob/main/deployments/container/Dockerfile.ubi8

cc @rthallisey

rthallisey commented 7 months ago

lgtm