Retries of failed jobs should ideally use a different agent

buildkite / feedback

Got feedback? Please let us know!

https://buildkite.com

25 stars 24 forks source link

Retries of failed jobs should ideally use a different agent #438

Open DazWorrall opened 5 years ago

DazWorrall commented 5 years ago

The job here was configured with automatic retires, but as the failures were caused by an infrastructure issue, the retries all failed because they were scheduled on the same agent. Preferring 'warm' agents when scheduling has benefits of course, but it would be useful if automatic(/all?) retries blacklisted the previous agent if the result was a failure.

cc @lox

lox commented 5 years ago

We'll discuss internally @DazWorrall!

djrodgerspryor commented 5 years ago

We've run into this a lot too. Most of our jobs involve docker, and if the docker daemon on the box gets into some sort of state and stops responding (which happens annoying often), then jobs on that agent will very quickly be accepted and failed many times.

We've actually built an external monitor service to kill any agents which fail too many jobs in a row, but it's often not quick enough, and jobs will run-out of retries before the zombie agent is terminated.

lox commented 5 years ago

We hear you, and agree that we need a mechanism for controlling agent selection on retries. We've discussed a few different mechanism, either via changes to our scheduler or a way for agents to run checks before accepting a job. Unfortunately neither are trivial, but they are on our radar.

In the meantime, perhaps a cron job that checks the state of a host and terminates the agent (and host) if docker has gotten into a bad state would help?