NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.68k stars 607 forks source link

Does operator support RHEL 7 Compute Nodes? #228

Open morristm opened 3 years ago

morristm commented 3 years ago

1. Installed NVIDA GPU Operator from Operator Hub in ROS 4.6.12 cluster. After creating a cluster-policy the nvidia-container-toolset and nvidia-driver-daemonset daemonset deploy pods to nodes that have GPUs installed on them. nvidia-driver-daemonset pods all fail trying to use rhocp-4.6-for-rhel-8-x86_64-rpms. These nodes run RHEL 7 so I would expect this to fail. Is there any support for this process for RHEL 7 compute nodes? If not are the steps I can follow to install the drivers manually on the node, then enable OpenShift and crio to allow use of the gpu's by containers.

2.Install NVIDIA GPU Operator then create a ClusterPolicy

3. The following is one of the log files from an nvidia-driver-daemonset pod

========== NVIDIA Software Installer ==========

Arjun-D7 commented 3 years ago

Hi Morristm

Did you resolved this issue? i am facing the same issue in my OpenShift environment.

please let us know if you know the fix for this

morristm commented 3 years ago

Hi Arjun, I wasn’t able to get the operator to work, so I decided to install the CUDA support on my workers manually. I’m fairly certain I started with this link https://developer.nvidia.com/cuda-toolkit https://developer.nvidia.com/cuda-toolkit. Once I had CUDA working and could run standalone containers on each worker that access the GPUs I took the next steps to deploy the OpenShift resources that allow containers to access and share the GPUs.

Here are the notes I took while doing this, can’t guarantee they’re perfect, but should help a little.

I had forgotten about this, and have tested the Nvidia Operator recently but considering you’re asking I suspect its still intended for RCOS workers.

Good Luck, Tom

On Aug 17, 2021, at 2:11 PM, Arjun-D7 @.***> wrote:

Hi Morristm

Did you resolved this issue? i am facing the same issue in my OpenShift environment.

please let us know if you know the fix for this

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/k8s-device-plugin/issues/228#issuecomment-900522475, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGF4UPAIKVZJU2N4QEENREDT5KQ4PANCNFSM4WUIE3YA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.