NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.74k stars 282 forks source link

Nvidia operator doesn't schedule required nvidia driver pods on the new GPU machines #397

Open arpitsharma-vw opened 2 years ago

arpitsharma-vw commented 2 years ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

We have openshift cluster with NFD and nvidia operator.

1. Issue or feature description

When I add a new type of GPU node in a cluster ( via machine sets). Nvidia operator doesn't schedule required nvidia driver pods on the new machines. I have to manually add the deamonsets with tolearation of new gpu type instance. Any specific settings I need to do here for this? If I manually add tolerations in daemons sets, then we can find pods on these new GPU nodes.

2. Steps to reproduce the issue

Add new GPU nodes by machineset Check the pods on new GPU machines

3. Information to attach (optional if deemed irrelevant)

shivamerla commented 2 years ago

@arpitsharma-hexad You can deploy operator with custom tolerations using daemonsets.tolerations parameter. Defaults here