Investigate [Kubernetes] Init Container Terminated With Error alerts

bcgov / DITP-DevOps

Digital Identity and Trust Program Team's DevOps Documentation Repository

Apache License 2.0

2 stars 5 forks source link

Investigate [Kubernetes] Init Container Terminated With Error alerts #131

Closed i5okie closed 9 months ago

i5okie commented 10 months ago

We're seeing a lot of [Kubernetes] Init Container Terminated With Error alerts from bc0192-dev namespace. Specifically, these are coming from CrunchyDB traction-database-dev Postgres cluster.

https://app.sysdigcloud.com/#/event-feed?last=1209600&scope=kube_namespace_label_name+in+%28%22bc0192%22%29+and+kube_namespace_label_environment+in+%28%22dev%22%29

Alerts are re-triggering even after manually resolving previous alerts from the same pod. This is without the pod changing.
The pods in OCP do not show evidence of Init containers terminating with errors.

WadeBarnes commented 10 months ago

The alert is specifically about the nss-wrapper-init container.

WadeBarnes commented 10 months ago

The state and reason of an init container can be seen using oc describe pod, for example oc -n bc0192-dev describe pod traction-database-dev-pg-tg79-0. The current deployments are reporting State: Terminated and Reason: Completed, but these deployments are not the ones that triggered the alerts.

WadeBarnes commented 10 months ago

Alert was set to repeat every 10 minutes until resolved. I turned that off to reduce the number of alerts.

i5okie commented 10 months ago

The state and reason of an init container can be seen using oc describe pod, for example oc -n bc0192-dev describe pod traction-database-dev-pg-tg79-0. The current deployments are reporting State: Terminated and Reason: Completed, but these deployments are not the ones that triggered the alerts.

I could not find any history of pods that may have been replaced. I've replaced the crunchy cluster earlier today, didn't see more of these alerts being triggered. So looks like the resource requests/limits I've updated previously have resolved this issue.

i5okie commented 10 months ago

Do we really need to be alerted about Init Containers that terminate with errors? I see the value in having the metric, as it can aid in troubleshooting an issue with a pod. However, I do not understand the value of having an alert triggered from this query alone.

As per: https://kubernetes.io/docs/concepts/workloads/pods/init-containers/#understanding-init-containers

If a Pod's init container fails, the kubelet repeatedly restarts that init container until it succeeds. However, if the Pod has a restartPolicy of Never, and an init container fails during startup of that Pod, Kubernetes treats the overall Pod as failed.

If the init container continues to fail the pod will have the status of: Init:CrashLoopBackOff Would it not be more valuable to be alerted if the pod itself is in CrashLoopBackOff, or if we see Init:CrashLoopBackOff status?

rajpalc7 commented 10 months ago

Hi @i5okie We already have an alert set-up when for [Kubernetes] Pod Container Creating For a Long Time (Crashloopbackoff) status which triggered 43 times in past 20 days.

i5okie commented 10 months ago

Hi @i5okie We already have an alert set-up when for [Kubernetes] Pod Container Creating For a Long Time (Crashloopbackoff) status which triggered 43 times in past 20 days.

Right, so my question is, if the issue with an initContainer will lead to the pod being in CrashLoopBackOff, resulting in an alert. do we still need to be alerted for the init container terminating with error as well?

rajpalc7 commented 10 months ago

Good point, If @WadeBarnes agrees, I can go ahead and delete the initContainer alert

WadeBarnes commented 10 months ago

It helps narrow down the problem, if there is one. It's also good to know if init containers are having issues even if the pod eventually starts. It indicates inconsistencies in the initialization that may need to be addressed.

rajpalc7 commented 10 months ago

Just got a response back from sysdig support team regarding this:

Hello Rajpal & Dustin,

I took a deeper look at these alerts and they are indeed legitimate alerts. Looking at your metrics dashboard and filtering by traction-database-dev deployment and do see quite a bit of activity that occurred starting around 2pm on Oct 10th. It does appear that this deployment ran into some resource limits which caused the pod/containers to restart.

Can you check the AGE of the pods within your bc0192-dev namespace and whether you see any restarts?

Thanks,

-Gilberto Rosario