dell / omnia

An open-source toolkit for deploying and managing high performance clusters for HPC, AI, and data analytics workloads.
https://omnia-doc.readthedocs.io/en/latest/index.html
Apache License 2.0
216 stars 112 forks source link

k8s-device-plugin not deployed #2151

Open j0hnL opened 1 year ago

j0hnL commented 1 year ago

Describe the bug when a k8s-manager does not have a GPU Omnia will not deploy the k8s-device-plugin. We need to inspect the entire inventory for GPUs before deploying the plugin. I suggest we also taint or label any compute nodes that do not have GPUs because nvidia's plugin does not check. The AMD plugin seems to deploy just fine whether there are AMD accelerators or not.

naresh3774 commented 7 months ago

this is what i think:

Identify Nodes without GPUs: You need a mechanism to determine which compute nodes in your Kubernetes cluster do not have GPUs available. This can be done through manual inspection or automated scripts that query node specifications.

Node Labeling: Once you identify nodes without GPUs, apply labels to them using kubectl label nodes =. For example, you can label nodes without GPUs as gpu-enabled=false.

Node Tainting: Apply taints to nodes without GPUs to repel workloads that require GPUs. Taints prevent non-GPU workloads from being scheduled on these nodes. Use kubectl taint nodes =: to apply taints. For instance, you can use a taint like gpu-accelerator=false:NoSchedule.

Configure Workloads: Ensure that GPU-dependent workloads are configured to tolerate the taints or have node selectors that consider GPU availability. For example, in the Pod specification, you might add tolerations for the taints applied to nodes without GPUs.

abhishek-sa1 commented 4 months ago

This issue is fixed with PR #2238 .

@sujit-jadhav @j0hnL can we close this issue?