buildkite / agent-stack-k8s

Spin up an autoscaling stack of Buildkite Agents on Kubernetes
MIT License
79 stars 30 forks source link

Underlying parse error does not print if using gitEnvFrom for Git credentials #233

Closed mbarrien closed 7 months ago

mbarrien commented 10 months ago

In testing agent-stack-k8s, some of my podSpecs I wrote ended up being incorrectly formatted. When this happens, agent-stack-k8s tries to replace the container with a build failure job which just echoes the underlying error to the console, as defined in https://github.com/buildkite/agent-stack-k8s/blob/main/internal/controller/scheduler/scheduler.go#L433-L447.

However, when using gitEnvFrom for SSH credentials to checkout git, the underlying error gets masked, and instead the console just shows failure to checkout the Git repo (failing after 3 retries), and the underlying error is never printed. This is because the checkout container (nor any other container) do not get the gitEnvFrom attached to it when there a build failure. Thus checkout fails and the container echoing the error never prints.

The only workaround to this is to run the job, then identify and examine the pod it creates while it is in flight, and examine the pod manifest looking for the error text in the BUILDKITE_COMMAND environment variable. This is obviously not optimal.

DrJosh9000 commented 10 months ago

Thanks @mbarrien - you found the code, so you understand what's going on here. The goal of BuildFailureJob was a "best effort last ditch" attempt to surface validation problems, but clearly has issues.

We have ideas for a more comprehensive solution, but it won't be quick. In the meantime we'll accept PRs that would skip the checkout (see also #227) so that BuildFailureJob actually works.