Open lemaral opened 2 years ago
@lemaral we had to update each resource for supporting upgrade use cases. This usually happens if reconciliation is triggered quite often and should settle once drivers are loaded and all pods are running. You see this happening even after all pods are in good state?
@shivamerla thank you for your reply, yes it never stops, although the gpu operator itself is working perfectly. I am seeing this using Falco and had to disable the relevant default rule to stop the flood. I believe it takes a bit on etcd as well.
can you paste the output of kubectl get pods -n gpu-operator
and last logs of operator pod?
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes? Yeskubectl describe clusterpolicies --all-namespaces
) Yes1. Issue or feature description
According to audit log, Kubernetes roles seem to be continuously created : nvidia-driver nvidia-mig-manager nvidia-operator-validator etc
2. Steps to reproduce the issue
Install gpu-operator with helm chart (in kube-system)
3. Information to attach (optional if deemed irrelevant)
[ ] kubernetes pods status:
kubectl get pods --all-namespaces
[ ] kubernetes daemonset status:
kubectl get ds --all-namespaces
[ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME
[ ] If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME
[ ] Output of running a container on the GPU machine:
docker run -it alpine echo foo
[ ] Docker configuration file:
cat /etc/docker/daemon.json
[ ] Docker runtime configuration:
docker info | grep runtime
[ ] NVIDIA shared directory:
ls -la /run/nvidia
[ ] NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit
[ ] NVIDIA driver directory:
ls -la /run/nvidia/driver
[ ] kubelet logs
journalctl -u kubelet > kubelet.logs