Closed rupang790 closed 3 years ago
It suddenly recognized GPU correctly, but it was gone again after reboot the Node. And error occurred on dp pod as below:
The device plugin sees the GPU, but I don't have enough info to determine if allocating the GPU is causing an error. It could be that the VM tried to start, but failed. Maybe check the virt-launcher pod logs? What does lspci -nnks 0b:00
show after reboot?
@rupang790 , few things:
ImagePullSecrets
is optional and are typically needed when using images from private registries. In this case, the GPU DP is using image from public registry, so this isn't really needed. We will fix the manifests.
2월 22 00:43:30 worker01.eluon.okd.com hyperkube[2083]: W0222 00:43:30.439326 2083 kubelet_pods.go:883] Unable to retrieve pull secret kube-system/regcred for kube-system/nvidia-kubevirt-gpu-dp-daemonset-wqm9g due to secret "regcred" not found. The image pull may not succeed.
vfio-pci
driver, so that even after node reboot, it is bound to the right driver. From your DP logs, I see that the device discovery has happened correctly, however the DP is not able to establish connection and register with device manager. Is this happening on more than 1 node and is it consistently reproducible? @rthallisey, actually there is no VM. I just created CR for Kubevirt and restart dp pods but it shows that logs.
@sseetharaman6, about ImagePullSecrets, I will wait for fix the manifests. For ensuring, I checked all configuration related to VFIO and IOMMU, then I tried to reboot the node and check the vfio-pci driver bound to GPU correctly as below:
[root@worker01 ~]# lspci -nnks 0b:00
0b:00.0 3D controller [0302]: NVIDIA Corporation TU104GL [Tesla T4] [10de:1eb8] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:12a2]
Kernel driver in use: vfio-pci
Kernel modules: nouveau
[root@worker01 ~]# dmesg | grep vfio
[ 40.961199] vfio_pci: add [10de:1eb8[ffffffff:ffffffff]] class 0x000000/00000000
According to results, it seems that vfio-pci driver correctly bounded on GPU for me. However, It is not still available on Node check by allocatable and capacity resource of node.
[root@okd-bastion01 gpu]# oc get node worker01.eluon.okd.com -o json | jq '.status.allocatable'
{
"cpu": "47500m",
"devices.kubevirt.io/kvm": "110",
"devices.kubevirt.io/tun": "110",
"devices.kubevirt.io/vhost-net": "110",
"ephemeral-storage": "1078600448671",
"hugepages-1Gi": "10Gi",
"hugepages-2Mi": "0",
"memory": "21153860Ki",
"nvidia.com/TU104GL_Tesla_T4": "0",
"nvidia.com/gpu": "0",
"openshift.io/ens3f0netdev": "4",
"openshift.io/ens3f0vfio": "4",
"openshift.io/ens3f1netdev": "0",
"openshift.io/ens3f1vfio": "0",
"pods": "250"
}
[root@okd-bastion01 gpu]# oc get node worker01.eluon.okd.com -o json | jq '.status.capacity'
{
"cpu": "48",
"devices.kubevirt.io/kvm": "110",
"devices.kubevirt.io/tun": "110",
"devices.kubevirt.io/vhost-net": "110",
"ephemeral-storage": "1171521476Ki",
"hugepages-1Gi": "10Gi",
"hugepages-2Mi": "0",
"memory": "32790596Ki",
"nvidia.com/TU104GL_Tesla_T4": "0",
"nvidia.com/gpu": "0",
"openshift.io/ens3f0netdev": "4",
"openshift.io/ens3f0vfio": "4",
"openshift.io/ens3f1netdev": "0",
"openshift.io/ens3f1vfio": "0",
"pods": "250"
}
I have only one GPU Node which is worker01.eluon.okd.com, so I could not check that it will happen on other node or not. Sorry for that.
The device manager and device plugin communicate over a unix socket. Can you check if /var/lib/kubelet/device-plugins/kubelet.sock
exists? Maybe there's something else in the kubelet logs that can point to what's going on because clearly the gpu-device-plugin sees the GPU, but Kube doesn't.
@rthallisey, /var/lib/kubelet/device-plugins/kubelet.sock
existed as below:
[root@worker01 ~]# cd /var/lib/kubelet/device-plugins/
[root@worker01 device-plugins]# ll
합계 8
-rw-r--r--. 1 root root 0 2월 24 01:09 DEPRECATION
srwxr-xr-x. 1 root root 0 2월 24 01:09 kubelet.sock
-rw-------. 1 root root 4466 2월 24 22:29 kubelet_internal_checkpoint
srwxr-xr-x. 1 root root 0 2월 24 05:58 kubevirt-TU104GL_Tesla_T4.sock
srwxr-xr-x. 1 root root 0 2월 24 05:57 kubevirt-kvm.sock
srwxr-xr-x. 1 root root 0 2월 24 05:57 kubevirt-tun.sock
srwxr-xr-x. 1 root root 0 2월 24 05:57 kubevirt-vhost-net.sock
I tried to check Kubelet logs related to GPU and it seems related log I think:
2월 19 01:24:59 worker01.eluon.okd.com hyperkube[2822]: I0219 01:24:59.370515 2822 nvidia.go:95] Found device with vendorID "0x10de"
2월 19 01:24:59 worker01.eluon.okd.com hyperkube[2822]: W0219 01:24:59.370643 2822 nvidia.go:61] NVIDIA GPU metrics will not be available: Could not initialize NVML: could not load NVML library
2월 19 07:17:52 worker01.eluon.okd.com hyperkube[2083]: I0219 07:17:52.585075 2083 nvidia.go:95] Found device with vendorID "0x10de"
2월 19 07:17:52 worker01.eluon.okd.com hyperkube[2083]: W0219 07:17:52.585317 2083 nvidia.go:61] NVIDIA GPU metrics will not be available: Could not initialize NVML: could not load NVML library
However, is it possible that this issue could be related to hugepage setting or cpu-manager setting? I separate the gpu node with cpu-manager enabled node by MachineConfig, but huge page machine config is not.
Could be a kernel mismatch? I'm curious if you are able to use the gpu-operator to attach the gpu to a pod or if it will result in the same error.
@rthallisey, It can be related to kernel maybe. Actually I had problem with gpu-operator also after update cluster to OKD4.6. So I issued on Nvidia/gpu-operator as https://github.com/NVIDIA/gpu-operator/issues/144#issue-805089722.
With gpu-operator on OKD 4.6, there is an error about kernel-header rpm package. From OKD4.6 cluster, the worker node could use Fedora CoreOS 33 and kernel version 5.9.16, but no kernel-header-5.9.16 rpm anywhere. Even I change the configuration file before install gpu-operator, it keeps check kernel version and tries to download it.
So it could be related to kernel version I think. Do you have any idea to solve this?
The NVML library comes from https://github.com/NVIDIA/gpu-monitoring-tools, specifically here. It could be that the kubevirt-dp needs to update the NVML version to work with newer kernels. If you have access to your nodes, there is a Docker container you can run from gpu-monitoring-tools to get info on the GPU. That might be a way to confirm this.
@rthallisey, I will try to run gpu-monitoring-tools on my node and I will share the results.
For that, just run the monitoring tools on host node by docker(or podman) or install it on the cluster? Because I tried by 'podman run ~~' on host node(not cluster) but it shows Failed to initialize NVML
errors.
However, for GPU-operator, do I need to install gpu driver on my host node? As I understand, the drivers will run as Pod on OKD cluster (Kubernetes) after install GPU-operator or Kubevirt-GPU-device-plugin, is it correct?
Because I tried by 'podman run ~~' on host node(not cluster) but it shows Failed to initialize NVML errors.
I don't think you need to run it inside the cluster.
It's starting to look like the issue is gpu-monitoring-tools
not working your kernel version. It's worth filing an issue on gpu-monitoring-tools
to see what they say.
Refer to #17, I have been trying to use kubevirt-gpu-device-plugin on my OKD 4.6 Cluster with Nvidia T4 GPU, but it seems not work well for me. @kklemon I already clear the Kubelet-config issue and GPU use correct kernel driver as
vfio-pci
as results oflspci -nnks 0b:00
:However, I still can not see the gpu on cluster as:
For @sseetharaman6, the result of
systemctl status kubelet
:And I found some weird logs from
journalctl | grep Kubelet
:It said image pull may not succeed, but the pods are running.