NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.74k stars 281 forks source link

rhel 8.x support for GPU operator #291

Open prpaul opened 2 years ago

prpaul commented 2 years ago

Wanted to check if RHEL 8.2 is supported by GPU operator 1.9.0

If no support is available, in which version we can expect RHEL 8.2 support and when?

shivamerla commented 2 years ago

@prpaul no, we don't support RHEL 8.x worker nodes, but only CoreOS. There is no plan to support RHEL worker nodes in the short term.

tusharrobin commented 2 years ago

@shivamerla So if there is no planned support or roadmap, what is the alternative to GPU operator in the field?

Most of the deployments in production that we have seen will have RHEL 8 so what would you suggest should be the way of deployment on Kubernetes?

shivamerla commented 2 years ago

@tusharrobin are you referring to RHEL worker nodes in OCP or using upstream K8s?

On OCP, we could still use GPU operator but they need to build private driver container from here: https://gitlab.com/nvidia/container-images/driver/-/tree/master/rhel8 and reference it while installing GPU operator. Alternatively driver can be directly installed on RHEL nodes and pass driver.enabled=false with GPU Operator install.

With upstream K8s, other than the driver itself, need to make sure ubi8 variant of images are installed for GPU operator components using Helm.

helm install gpu-operator nvidia/gpu-operator --version=1.9.0 --set 
 operator.defaultRuntime=crio,toolkit.version=1.7.2-ubi8,dcgmExporter.version=2.3.1-2.6.0-ubi8,dcgm.version=2.3.1-ubi8,migManager.version=v0.2.0-ubi8

Also, --set driver.enabled=false when driver is pre-installed on each RHEL node.

But, this configuration will not be officially qualified or supported by the GPU Operator.

tusharrobin commented 2 years ago

@shivamerla Even with helm install gpu-operator nvidia/gpu-operator --version=1.9.0 --set operator.defaultRuntime=containerd,toolkit.version=1.7.2-ubi8,dcgmExporter.version=2.3.1-2.6.0-ubi8,dcgm.version=2.3.1-ubi8,migManager.version=v0.2.0-ubi8 --set driver.enabled=false

I am still seeing Warning FailedCreatePodSandBox 1s (x2 over 12s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = Exception calling application: ErrorUnknown:StatusCode.UNKNOWN:RuntimeHandler "nvidia" not supported

I see runtime class is available though

[root@priyanko-bnp-mig1 gpu-operator]# kubectl get runtimeclass NAME HANDLER AGE nvidia nvidia 71s

This is upstream Kubernetes.

shivamerla commented 2 years ago

@tusharrobin Can you show the status of all pods? Container toolkit pod has to be running for nvidia runtime to be configured with containerd. Also, previously there was a typo with version, it should be v1.9.0 with helm install. Based on the command you mentioned, i am assuming driver is pre-installed?

tusharrobin commented 2 years ago

@shivamerla I was able to install using the options after removing the defaultRuntime option as I was using docker. Thanks for all your help !

Is there a reason that RHEL 8 is not in GPU operator's roadmap? Since most of the deployments are moving to RHEL8/Rocky8, why would you not consider that as one of the supported platforms?

yug0slav commented 2 years ago

FYI NVAIE says it supports rhel8.4 on 1.9.1 operator, huh? :)

MrBoJo84 commented 2 years ago

@tusharrobin we are looking into support for additional operating systems. Do I understand correctly that you use k8s with RHEL 8 and containerd?

@yug0slav Please note that NVIDIA AI Enterprise supports RHEL 8.4 to run containers but without k8s. NVIDIA AI Enterprise doesn't support RHEL worker nodes with the GPU Operator.

tusharrobin commented 2 years ago

Yes, we need GPU operator support for RHEL 8 and Rocky 8.

relyt0925 commented 2 years ago

IBM Cloud Openshift also needs support for RHEL 8

KodieGlosserIBM commented 2 years ago

@MrBoJo84 To add, IBM Cloud Openshift uses cri-o for our container runtime.

snirkatriel commented 2 years ago

Hi, Most of the corporates are using RHEL 8.x and even 9.x in the near future. We're currently struggling to install nvidia-driver on airgap environments and gpu-operator is the complete solution for us. I think it's a very useful and necessary support matrix.

MrBoJo84 commented 2 years ago

@snirkatriel it would help if you could share the exact stack that you are looking support for. Is it with OpenShift or Kubernetes? Is it with containerd or crio? Which versions?

snirkatriel commented 2 years ago

@snirkatriel it would help if you could share the exact stack that you are looking support for. Is it with OpenShift or Kubernetes? Is it with containerd or crio? Which versions?

Sure. We're using Kubernetes (k3s) with containerd runtime, we're looking into RHEL 8.3,8.4,8.6 and so on.