Currently, our backoff error watcher monitors all containers in a pod, which is problematic for customers who heavily rely on sidecars.
Since sidecar errors theoretically do not impact the health of pipeline jobs, canceling an entire job based on the status of a sidecar only adds unnecessary trouble.
At the moment, we don't provide governance support for sidecars, meaning customers can't see logs from sidecars. When we kill a job due to a sidecar problem, customers aren't given a proper reason. Sometimes, their CI workload is functioning correctly, but some sidecars have a delayed start, leading to the job being killed. From the customers' perspective, everything appears to be working fine until something randomly terminates the job, which is frustrating.
Customers can debug sidecar issues themselves through their Kubernetes platform.
This PR reduces the scope of the image pull backoff error watcher so it only monitors containers that we actively govern.
NOTE: A longer-term solution is being planned to address the observability issues comprehensively, so this PR is a temporary solution.
Currently, our backoff error watcher monitors all containers in a pod, which is problematic for customers who heavily rely on sidecars.
Since sidecar errors theoretically do not impact the health of pipeline jobs, canceling an entire job based on the status of a sidecar only adds unnecessary trouble.
At the moment, we don't provide governance support for sidecars, meaning customers can't see logs from sidecars. When we kill a job due to a sidecar problem, customers aren't given a proper reason. Sometimes, their CI workload is functioning correctly, but some sidecars have a delayed start, leading to the job being killed. From the customers' perspective, everything appears to be working fine until something randomly terminates the job, which is frustrating.
Customers can debug sidecar issues themselves through their Kubernetes platform.
This PR reduces the scope of the image pull backoff error watcher so it only monitors containers that we actively govern.
NOTE: A longer-term solution is being planned to address the observability issues comprehensively, so this PR is a temporary solution.