Pebble killing pods when juju api is unresponsive

canonical-is commented 1 year ago

Hi, pebble checks fails when juju api is unresponsive and restarts the pod. However, if/when this check fails because juju itself is indeed unresponsive (and not because of an issue with a k8s resource, eg pebble pod) all of the pods get restarted, possibly at the same time, in which case the live/running replica is left with 0 live pods to serve traffic.

This behavior outlines when juju api is unresponsive which is good, but this also brings services down that could have continued working just fine. Could we have a setting to inhibit this behavior or to be able to quickly adjust a timeout for this behavior ?

Pebble is killing working pod because of an ongoing juju load issue we are having, see https://bugs.launchpad.net/juju/+bug/1934524 (also there is historic load issue in this bug). Investigating the controllers or bringing the load down usually means bringing the controllers temporarily down, which effectively restarts some pods (eg : wordpress ones)

jnsgruk commented 1 year ago

This is a little confusing - Pebble cannot "kill a pod" since it is running in the pod.

If the defined health checks fail for long enough, Kubernetes will reap the pod. Can we see the health checks in question? It would be useful to see the Pebble plan and some logs.

canonical-is commented 1 year ago

Indeed, k8s reaps the pod when then health checks fail for long enough. I believe this is what happens because of the juju controller being unresponsive in this (internal) example https://pastebin.canonical.com/p/q4Zkt5MsJP/

benhoyt commented 1 year ago

Yes, it seems like this can't be a Pebble issue, so closing this. Please reopen as a Juju issue or discuss on the https://chat.charmhub.io/charmhub/channels/juju for further advice.

hloeung commented 1 year ago

For those following along, this bug was filed against Juju - https://bugs.launchpad.net/juju/+bug/2036594

canonical / pebble

Pebble killing pods when juju api is unresponsive #301