Closed age9990 closed 4 months ago
@age9990 thanks for reporting this issue.
I have a fix out for this here: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/1083
GPU Operator 24.6.0 has been released and contains the fix for this issue.
1. Quick Debug Information
2. Issue or feature description
When enable GDRcopy in nvidia driver CR, driver daemonset is not changed and error log showed in gpu operator pod. {"level":"error","ts":"2024-05-03T06:29:33.398Z","msg":"Error while syncing state","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"a902d530-65d4-480e-8157-0e0c21d0a332","error":"failed to create k8s objects from manifests: failed to render kubernetes manifests: error rendering file /opt/gpu-operator/manifests/state-driver/0500_daemonset.yaml: failed to unmarshal manifest /opt/gpu-operator/manifests/state-driver/0500_daemonset.yaml: error converting YAML to JSON: yaml: line 195: did not find expected key"}
Looking into this file, the indentation is not correct, missing two spaces from L493 to L496. https://github.com/NVIDIA/gpu-operator/blob/0fe1e8db32b05ddab8bbd4d5bcc3f492b75cfee4/manifests/state-driver/0500_daemonset.yaml#L478-L498
Once I fixed the indentation and rebuilt the image, the GDRcopy can be enabled with no error.