NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.85k stars 297 forks source link

Kubernetes roles are continuously created #354

Open lemaral opened 2 years ago

lemaral commented 2 years ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

1. Issue or feature description

According to audit log, Kubernetes roles seem to be continuously created : nvidia-driver nvidia-mig-manager nvidia-operator-validator etc

2. Steps to reproduce the issue

Install gpu-operator with helm chart (in kube-system)

3. Information to attach (optional if deemed irrelevant)

shivamerla commented 2 years ago

@lemaral we had to update each resource for supporting upgrade use cases. This usually happens if reconciliation is triggered quite often and should settle once drivers are loaded and all pods are running. You see this happening even after all pods are in good state?

lemaral commented 2 years ago

@shivamerla thank you for your reply, yes it never stops, although the gpu operator itself is working perfectly. I am seeing this using Falco and had to disable the relevant default rule to stop the flood. I believe it takes a bit on etcd as well.

shivamerla commented 2 years ago

can you paste the output of kubectl get pods -n gpu-operator and last logs of operator pod?