Open guyst16 opened 1 year ago
@guyst16 can you attach gpu-operator pod logs to confirm if gpu-operator is triggering un-cordon of the node. Also, can you try with driver.upgradePolicy.autoUpgrade=false
in ClusterPolicy and verify if same behavior is happening?
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes?kubectl describe clusterpolicies --all-namespaces
)[x] GPU operator v22.9.1
1. Issue or feature description
I tried to cordon a GPU worker node but it immediately initiated un-cordon. I followed this closed issue and relabeled the node to
nvidia.com/gpu.deploy.driver: 'false'
, the driver pod got terminated but the node still got back to un-cordon every time.Cluster policy:
Node info:
Nvidia pods:
2. Steps to reproduce the issue
Mark the node as unscheduable: