NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.87k stars 304 forks source link

K8s cluster with two gpu nodes with centos 7, centos 8 #311

Open sricharanrobinsystems opened 2 years ago

sricharanrobinsystems commented 2 years ago

I have a 6 node Kubernetes cluster with a GPU operator 1.9 installed. I have 2 GPU servers ec2 type - p2, p3 on AWS. I have installed centos 7 on p2(ec2) and centos 8 on P3( ec2).

Can the GPU operator work for both flavors of OS?

[root@ip-172-31-15-125 ~]# helm list -A NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION gpu-operator-1642901166 gpu-operator 1 2022-01-23 01:26:09.453530643 +0000 UTC deployed gpu-operator-v1.9.1 v1.9.1 [root@ip-172-31-15-125 ~]# for centos 8 - i used the below helm command -

`helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set toolkit.version=1.7.1-centos8

running the above command on centos 8 single k8s node cluster resulting as below -

[root@ip-172-31-15-125 ~]# kubectl get pods -A |grep gpu gpu-operator gpu-feature-discovery-mxv6k 0/1 Init:0/1 0 36m gpu-operator gpu-operator-1642901166-node-feature-discovery-master-6d66bkp5j 1/1 Running 0 41m gpu-operator gpu-operator-1642901166-node-feature-discovery-worker-jdhms 1/1 Running 0 41m gpu-operator gpu-operator-84b88fc49c-dhx8l 1/1 Running 0 41m gpu-operator nvidia-container-toolkit-daemonset-jmr6t 1/1 Running 0 36m gpu-operator nvidia-dcgm-exporter-nt6ng 0/1 Init:0/1 0 36m gpu-operator nvidia-device-plugin-daemonset-xt2d4 0/1 Init:0/1 0 36m gpu-operator nvidia-driver-daemonset-5tdfx 1/1 Running 0 36m gpu-operator nvidia-operator-validator-dh95t 0/1 Init:0/4 0 36m [root@ip-172-31-15-125 ~]#

sricharanrobinsystems commented 2 years ago

image

shivamerla commented 2 years ago

@sricharanrobinsystems we don't support clusters with mixed OS distributions/versions currently. Also, we support only CentOS7. Is this output with CentOS8?

yug0slav commented 2 years ago

Meaning GPU operator requires the entire cluster down for OS upgrade/migration from 7 to 8? sweeeet....