buildkite / agent-stack-k8s

Spin up an autoscaling stack of Buildkite Agents on Kubernetes
MIT License
79 stars 30 forks source link

PS-67: stop image pull backoff error handling for sidecars #344

Closed zhming0 closed 3 months ago

zhming0 commented 4 months ago

Currently, our backoff error watcher monitors all containers in a pod, which is problematic for customers who heavily rely on sidecars.

Since sidecar errors theoretically do not impact the health of pipeline jobs, canceling an entire job based on the status of a sidecar only adds unnecessary trouble.

At the moment, we don't provide governance support for sidecars, meaning customers can't see logs from sidecars. When we kill a job due to a sidecar problem, customers aren't given a proper reason. Sometimes, their CI workload is functioning correctly, but some sidecars have a delayed start, leading to the job being killed. From the customers' perspective, everything appears to be working fine until something randomly terminates the job, which is frustrating.

Customers can debug sidecar issues themselves through their Kubernetes platform.

This PR reduces the scope of the image pull backoff error watcher so it only monitors containers that we actively govern.

NOTE: A longer-term solution is being planned to address the observability issues comprehensively, so this PR is a temporary solution.