NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.77k stars 286 forks source link

Cannot enable GDRcopy using Nvidia driver CRD due to wrong indentation in 0500_daemonset.yaml #713

Closed age9990 closed 4 months ago

age9990 commented 5 months ago

1. Quick Debug Information

2. Issue or feature description

When enable GDRcopy in nvidia driver CR, driver daemonset is not changed and error log showed in gpu operator pod. {"level":"error","ts":"2024-05-03T06:29:33.398Z","msg":"Error while syncing state","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"a902d530-65d4-480e-8157-0e0c21d0a332","error":"failed to create k8s objects from manifests: failed to render kubernetes manifests: error rendering file /opt/gpu-operator/manifests/state-driver/0500_daemonset.yaml: failed to unmarshal manifest /opt/gpu-operator/manifests/state-driver/0500_daemonset.yaml: error converting YAML to JSON: yaml: line 195: did not find expected key"}

Looking into this file, the indentation is not correct, missing two spaces from L493 to L496. https://github.com/NVIDIA/gpu-operator/blob/0fe1e8db32b05ddab8bbd4d5bcc3f492b75cfee4/manifests/state-driver/0500_daemonset.yaml#L478-L498

Once I fixed the indentation and rebuilt the image, the GDRcopy can be enabled with no error.

cdesiniotis commented 4 months ago

@age9990 thanks for reporting this issue.

cdesiniotis commented 4 months ago

I have a fix out for this here: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/1083

cdesiniotis commented 2 months ago

GPU Operator 24.6.0 has been released and contains the fix for this issue.