enix / kube-image-keeper

kuik is a container image caching system for Kubernetes
MIT License
447 stars 34 forks source link

Controller crashes as soon as it starts #371

Open AtharvaBapat-TomTom opened 2 months ago

AtharvaBapat-TomTom commented 2 months ago

I am trying to install the kube-image-keeper using Helm Charts. But I am facing an error where the controller shutdowns in few minutes. Here are last few lines of the debug logs.

2024-07-26T06:27:32.614Z    ERROR   Reconciler error    {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"change-generation-workflow-trigger-map-matchers-parent-zsculf9d","namespace":"airflow"}, "namespace": "airflow", "name": "change-generation", "reconcileID": "e9b1e732-2f09-48ee-826b-2e7b811af12a", "error": "client rate limiter Wait returned an error: context canceled"}
2024-07-26T06:27:32.616Z    INFO    All workers finished    {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod"}
2024-07-26T06:27:32.616Z    INFO    Stopping and waiting for caches
2024-07-26T06:27:32.616Z    INFO    Stopping and waiting for webhooks
2024-07-26T06:27:32.624Z    INFO    controller-runtime.webhook  Shutting down webhook server with timeout of 1 minute
2024-07-26T06:27:32.624Z    INFO    Wait completed, proceeding to shutdown the manager
2024-07-26T06:27:32.624Z    ERROR   setup   problem running manager {"error": "Pod \"observability-solution-usage-28579039-h6hfd\" is invalid: metadata: Invalid value: \"Burstable\": Pod QoS is immutable"}

The Kubernetes version is 1.29 The helm chart version for the kube-image-keeper is 1.9.2

paullaffitte commented 2 months ago

Hello, thank you for your submission.

Do you use any resource limits or requests on the pod observability-solution-usage-28579039-h6hfd? Do you use any mutating webhook / controller that could have injected a container in the pod configuration or changed resource limits/requests of any container of the pod?

kppullin commented 1 month ago

We have a similar startup failure with a potential workaround in #392 (and #397 due to my github fail...).

The hint here about a mutating webhook was helpful and led us to uncover a Kyverno "mutate rule" that results in patch failures during the setup phase. Although the Kyverno mutation is effectively a no-op change, it is detected as a change on the k8s side and is rejected.

While we can likely ignore patch operations in this specific Kyverno rule, it still leaves open a potential denial-of-service should a future policy be created in a similar fashion.