Nvidia T4 GPU is not recognized on my OKD Cluster.

rupang790 commented 3 years ago

Refer to #17, I have been trying to use kubevirt-gpu-device-plugin on my OKD 4.6 Cluster with Nvidia T4 GPU, but it seems not work well for me. @kklemon I already clear the Kubelet-config issue and GPU use correct kernel driver as vfio-pci as results of lspci -nnks 0b:00:

However, I still can not see the gpu on cluster as:

For @sseetharaman6, the result of systemctl status kubelet :

And I found some weird logs from journalctl | grep Kubelet:

2월 22 00:43:30 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:30.438768    2083 kubelet_pods.go:1486] Generating status for "nvidia-kubevirt-gpu-dp-daemonset-wqm9g_kube-system(81d5bfee-cc3e-48c0-b347-a9acee3f402e)"
 2월 22 00:43:30 worker01.eluon.okd.com hyperkube[2083]: W0222 00:43:30.439326    2083 kubelet_pods.go:883] Unable to retrieve pull secret kube-system/regcred for kube-system/nvidia-kubevirt-gpu-dp-daemonset-wqm9g due to secret "regcred" not found.  The image pull may not succeed.
 2월 22 00:43:30 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:30.549759    2083 secret.go:183] Setting up volume default-token-jln9d for pod 81d5bfee-cc3e-48c0-b347-a9acee3f402e at /var/lib/kubelet/pods/81d5bfee-cc3e-48c0-b347-a9acee3f402e/volumes/kubernetes.io~secret/default-token-jln9d
 2월 22 00:43:31 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:31.438660    2083 kubelet_pods.go:1486] Generating status for "sriov-cni-kvscd_openshift-sriov-network-operator(ec7aec0b-f1a5-466f-93c2-703409faa754)"
 2월 22 00:43:31 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:31.553158    2083 secret.go:183] Setting up volume sriov-cni-token-qf8dt for pod ec7aec0b-f1a5-466f-93c2-703409faa754 at /var/lib/kubelet/pods/ec7aec0b-f1a5-466f-93c2-703409faa754/volumes/kubernetes.io~secret/sriov-cni-token-qf8dt
 2월 22 00:43:35 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:35.438748    2083 kubelet_pods.go:1486] Generating status for "network-metrics-daemon-s6llr_openshift-multus(4b250b14-1810-4108-8f1c-0b943a0f86f3)"
 2월 22 00:43:35 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:35.565918    2083 secret.go:183] Setting up volume metrics-certs for pod 4b250b14-1810-4108-8f1c-0b943a0f86f3 at /var/lib/kubelet/pods/4b250b14-1810-4108-8f1c-0b943a0f86f3/volumes/kubernetes.io~secret/metrics-certs
 2월 22 00:43:35 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:35.565976    2083 secret.go:183] Setting up volume metrics-daemon-sa-token-fnvkh for pod 4b250b14-1810-4108-8f1c-0b943a0f86f3 at /var/lib/kubelet/pods/4b250b14-1810-4108-8f1c-0b943a0f86f3/volumes/kubernetes.io~secret/metrics-daemon-sa-token-fnvkh
 2월 22 00:43:46 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:46.438715    2083 kubelet_pods.go:1486] Generating status for "node-ca-cwmr9_openshift-image-registry(807f074c-724e-4a6e-b61e-c4ca3587d238)"
 2월 22 00:43:46 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:46.601043    2083 secret.go:183] Setting up volume node-ca-token-tjbfh for pod 807f074c-724e-4a6e-b61e-c4ca3587d238 at /var/lib/kubelet/pods/807f074c-724e-4a6e-b61e-c4ca3587d238/volumes/kubernetes.io~secret/node-ca-token-tjbfh
 2월 22 00:43:46 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:46.601085    2083 configmap.go:188] Setting up volume serviceca for pod 807f074c-724e-4a6e-b61e-c4ca3587d238 at /var/lib/kubelet/pods/807f074c-724e-4a6e-b61e-c4ca3587d238/volumes/kubernetes.io~configmap/serviceca
 2월 22 00:43:47 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:47.438611    2083 kubelet_pods.go:1486] Generating status for "tuned-55hmm_openshift-cluster-node-tuning-operator(d234043c-78e4-480d-97e7-b482bd95475a)"
 2월 22 00:43:47 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:47.604577    2083 secret.go:183] Setting up volume tuned-token-7rv7f for pod d234043c-78e4-480d-97e7-b482bd95475a at /var/lib/kubelet/pods/d234043c-78e4-480d-97e7-b482bd95475a/volumes/kubernetes.io~secret/tuned-token-7rv7f
 2월 22 00:43:47 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:47.604617    2083 configmap.go:188] Setting up volume var-lib-tuned-profiles-data for pod d234043c-78e4-480d-97e7-b482bd95475a at /var/lib/kubelet/pods/d234043c-78e4-480d-97e7-b482bd95475a/volumes/kubernetes.io~configmap/var-lib-tuned-profiles-data
 2월 22 00:43:48 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:48.438706    2083 kubelet_pods.go:1486] Generating status for "ovs-nnvd2_openshift-sdn(85936976-fc04-4b2a-bcfe-caf8be6d9489)"
 2월 22 00:43:48 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:48.607229    2083 secret.go:183] Setting up volume sdn-token-c9dq4 for pod 85936976-fc04-4b2a-bcfe-caf8be6d9489 at /var/lib/kubelet/pods/85936976-fc04-4b2a-bcfe-caf8be6d9489/volumes/kubernetes.io~secret/sdn-token-c9dq4
 2월 22 00:43:50 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:50.438673    2083 kubelet_pods.go:1486] Generating status for "virt-handler-kxpk9_kubevirt(76e5bd54-d81a-4c19-b2d8-a630954d424e)"
 2월 22 00:43:50 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:50.612699    2083 secret.go:183] Setting up volume kubevirt-handler-token-5pw6c for pod 76e5bd54-d81a-4c19-b2d8-a630954d424e at /var/lib/kubelet/pods/76e5bd54-d81a-4c19-b2d8-a630954d424e/volumes/kubernetes.io~secret/kubevirt-handler-token-5pw6c
 2월 22 00:43:50 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:50.612721    2083 secret.go:183] Setting up volume kubevirt-virt-handler-server-certs for pod 76e5bd54-d81a-4c19-b2d8-a630954d424e at /var/lib/kubelet/pods/76e5bd54-d81a-4c19-b2d8-a630954d424e/volumes/kubernetes.io~secret/kubevirt-virt-handler-server-certs
 2월 22 00:43:50 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:50.612790    2083 secret.go:183] Setting up volume kubevirt-virt-handler-certs for pod 76e5bd54-d81a-4c19-b2d8-a630954d424e at /var/lib/kubelet/pods/76e5bd54-d81a-4c19-b2d8-a630954d424e/volumes/kubernetes.io~secret/kubevirt-virt-handler-certs
 2월 22 00:43:51 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:51.438751    2083 kubelet_pods.go:1486] Generating status for "router-default-8fdb648b5-h9tdk_openshift-ingress(426653cb-8deb-4962-b72b-88494c7d8699)"
 2월 22 00:43:51 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:51.615615    2083 secret.go:183] Setting up volume metrics-certs for pod 426653cb-8deb-4962-b72b-88494c7d8699 at /var/lib/kubelet/pods/426653cb-8deb-4962-b72b-88494c7d8699/volumes/kubernetes.io~secret/metrics-certs
 2월 22 00:43:51 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:51.615693    2083 configmap.go:188] Setting up volume service-ca-bundle for pod 426653cb-8deb-4962-b72b-88494c7d8699 at /var/lib/kubelet/pods/426653cb-8deb-4962-b72b-88494c7d8699/volumes/kubernetes.io~configmap/service-ca-bundle
 2월 22 00:43:51 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:51.615745    2083 secret.go:183] Setting up volume default-certificate for pod 426653cb-8deb-4962-b72b-88494c7d8699 at /var/lib/kubelet/pods/426653cb-8deb-4962-b72b-88494c7d8699/volumes/kubernetes.io~secret/default-certificate
 2월 22 00:43:51 worker01.eluon.okd.com hyperkube[2083]: I0222 00:43:51.615851    2083 secret.go:183] Setting up volume router-token-2rmf2 for pod 426653cb-8deb-4962-b72b-88494c7d8699 at /var/lib/kubelet/pods/426653cb-8deb-4962-b72b-88494c7d8699/volumes/kubernetes.io~secret/router-token-2rmf2

It said image pull may not succeed, but the pods are running.

rupang790 commented 3 years ago

It suddenly recognized GPU correctly, but it was gone again after reboot the Node. And error occurred on dp pod as below:

rthallisey commented 3 years ago

The device plugin sees the GPU, but I don't have enough info to determine if allocating the GPU is causing an error. It could be that the VM tried to start, but failed. Maybe check the virt-launcher pod logs? What does lspci -nnks 0b:00 show after reboot?

sseetharaman6 commented 3 years ago

@rupang790 , few things:

ImagePullSecrets is optional and are typically needed when using images from private registries. In this case, the GPU DP is using image from public registry, so this isn't really needed. We will fix the manifests.
```
2월 22 00:43:30 worker01.eluon.okd.com hyperkube[2083]: W0222 00:43:30.439326    2083 kubelet_pods.go:883] Unable to retrieve pull secret kube-system/regcred for kube-system/nvidia-kubevirt-gpu-dp-daemonset-wqm9g due to secret "regcred" not found.  The image pull may not succeed.
```
1. Like Ryan pointed out, please ensure that the GPUs are persistently bound to vfio-pci driver, so that even after node reboot, it is bound to the right driver. From your DP logs, I see that the device discovery has happened correctly, however the DP is not able to establish connection and register with device manager. Is this happening on more than 1 node and is it consistently reproducible?

rupang790 commented 3 years ago

@rthallisey, actually there is no VM. I just created CR for Kubevirt and restart dp pods but it shows that logs.

@sseetharaman6, about ImagePullSecrets, I will wait for fix the manifests. For ensuring, I checked all configuration related to VFIO and IOMMU, then I tried to reboot the node and check the vfio-pci driver bound to GPU correctly as below:

[root@worker01 ~]# lspci -nnks 0b:00
0b:00.0 3D controller [0302]: NVIDIA Corporation TU104GL [Tesla T4] [10de:1eb8] (rev a1)
    Subsystem: NVIDIA Corporation Device [10de:12a2]
    Kernel driver in use: vfio-pci
    Kernel modules: nouveau

[root@worker01 ~]# dmesg | grep vfio
[   40.961199] vfio_pci: add [10de:1eb8[ffffffff:ffffffff]] class 0x000000/00000000

According to results, it seems that vfio-pci driver correctly bounded on GPU for me. However, It is not still available on Node check by allocatable and capacity resource of node.

[root@okd-bastion01 gpu]# oc get node worker01.eluon.okd.com -o json | jq '.status.allocatable'
{
  "cpu": "47500m",
  "devices.kubevirt.io/kvm": "110",
  "devices.kubevirt.io/tun": "110",
  "devices.kubevirt.io/vhost-net": "110",
  "ephemeral-storage": "1078600448671",
  "hugepages-1Gi": "10Gi",
  "hugepages-2Mi": "0",
  "memory": "21153860Ki",
  "nvidia.com/TU104GL_Tesla_T4": "0",
  "nvidia.com/gpu": "0",
  "openshift.io/ens3f0netdev": "4",
  "openshift.io/ens3f0vfio": "4",
  "openshift.io/ens3f1netdev": "0",
  "openshift.io/ens3f1vfio": "0",
  "pods": "250"
}
[root@okd-bastion01 gpu]# oc get node worker01.eluon.okd.com -o json | jq '.status.capacity'
{
  "cpu": "48",
  "devices.kubevirt.io/kvm": "110",
  "devices.kubevirt.io/tun": "110",
  "devices.kubevirt.io/vhost-net": "110",
  "ephemeral-storage": "1171521476Ki",
  "hugepages-1Gi": "10Gi",
  "hugepages-2Mi": "0",
  "memory": "32790596Ki",
  "nvidia.com/TU104GL_Tesla_T4": "0",
  "nvidia.com/gpu": "0",
  "openshift.io/ens3f0netdev": "4",
  "openshift.io/ens3f0vfio": "4",
  "openshift.io/ens3f1netdev": "0",
  "openshift.io/ens3f1vfio": "0",
  "pods": "250"
}

I have only one GPU Node which is worker01.eluon.okd.com, so I could not check that it will happen on other node or not. Sorry for that.

rthallisey commented 3 years ago

The device manager and device plugin communicate over a unix socket. Can you check if /var/lib/kubelet/device-plugins/kubelet.sock exists? Maybe there's something else in the kubelet logs that can point to what's going on because clearly the gpu-device-plugin sees the GPU, but Kube doesn't.

rupang790 commented 3 years ago

@rthallisey, /var/lib/kubelet/device-plugins/kubelet.sock existed as below:

[root@worker01 ~]# cd /var/lib/kubelet/device-plugins/
[root@worker01 device-plugins]# ll
합계 8
-rw-r--r--. 1 root root    0  2월 24 01:09 DEPRECATION
srwxr-xr-x. 1 root root    0  2월 24 01:09 kubelet.sock
-rw-------. 1 root root 4466  2월 24 22:29 kubelet_internal_checkpoint
srwxr-xr-x. 1 root root    0  2월 24 05:58 kubevirt-TU104GL_Tesla_T4.sock
srwxr-xr-x. 1 root root    0  2월 24 05:57 kubevirt-kvm.sock
srwxr-xr-x. 1 root root    0  2월 24 05:57 kubevirt-tun.sock
srwxr-xr-x. 1 root root    0  2월 24 05:57 kubevirt-vhost-net.sock

I tried to check Kubelet logs related to GPU and it seems related log I think:

2월 19 01:24:59 worker01.eluon.okd.com hyperkube[2822]: I0219 01:24:59.370515    2822 nvidia.go:95] Found device with vendorID "0x10de"
 2월 19 01:24:59 worker01.eluon.okd.com hyperkube[2822]: W0219 01:24:59.370643    2822 nvidia.go:61] NVIDIA GPU metrics will not be available: Could not initialize NVML: could not load NVML library
 2월 19 07:17:52 worker01.eluon.okd.com hyperkube[2083]: I0219 07:17:52.585075    2083 nvidia.go:95] Found device with vendorID "0x10de"
 2월 19 07:17:52 worker01.eluon.okd.com hyperkube[2083]: W0219 07:17:52.585317    2083 nvidia.go:61] NVIDIA GPU metrics will not be available: Could not initialize NVML: could not load NVML library

However, is it possible that this issue could be related to hugepage setting or cpu-manager setting? I separate the gpu node with cpu-manager enabled node by MachineConfig, but huge page machine config is not.

rthallisey commented 3 years ago

Could be a kernel mismatch? I'm curious if you are able to use the gpu-operator to attach the gpu to a pod or if it will result in the same error.

rupang790 commented 3 years ago

@rthallisey, It can be related to kernel maybe. Actually I had problem with gpu-operator also after update cluster to OKD4.6. So I issued on Nvidia/gpu-operator as https://github.com/NVIDIA/gpu-operator/issues/144#issue-805089722.

With gpu-operator on OKD 4.6, there is an error about kernel-header rpm package. From OKD4.6 cluster, the worker node could use Fedora CoreOS 33 and kernel version 5.9.16, but no kernel-header-5.9.16 rpm anywhere. Even I change the configuration file before install gpu-operator, it keeps check kernel version and tries to download it.

So it could be related to kernel version I think. Do you have any idea to solve this?

rthallisey commented 3 years ago

The NVML library comes from https://github.com/NVIDIA/gpu-monitoring-tools, specifically here. It could be that the kubevirt-dp needs to update the NVML version to work with newer kernels. If you have access to your nodes, there is a Docker container you can run from gpu-monitoring-tools to get info on the GPU. That might be a way to confirm this.

rupang790 commented 3 years ago

@rthallisey, I will try to run gpu-monitoring-tools on my node and I will share the results. For that, just run the monitoring tools on host node by docker(or podman) or install it on the cluster? Because I tried by 'podman run ~~' on host node(not cluster) but it shows Failed to initialize NVML errors.

However, for GPU-operator, do I need to install gpu driver on my host node? As I understand, the drivers will run as Pod on OKD cluster (Kubernetes) after install GPU-operator or Kubevirt-GPU-device-plugin, is it correct?

rthallisey commented 3 years ago

Because I tried by 'podman run ~~' on host node(not cluster) but it shows Failed to initialize NVML errors.

I don't think you need to run it inside the cluster.

It's starting to look like the issue is gpu-monitoring-tools not working your kernel version. It's worth filing an issue on gpu-monitoring-tools to see what they say.

NVIDIA / kubevirt-gpu-device-plugin

Nvidia T4 GPU is not recognized on my OKD Cluster. #18