clean up fails in case of job timeouts

scriptnull commented 1 year ago

Hi, we recently hit a case where a job has a timeout and is using this plugin. When the timeout was hit, the job ended without performing the cleanup steps of this plugin. This in turn caused the successive runs of the same job in the same node to have troubles (because of the residue from the failed cleanup)

I wonder if we can tackle this problem at this plugin level. If at this plugin level, then we might introduce a new flag like pre-cleanup and remove all containers and volumes on the host before the docker-compose up operation.

Do you have any suggestions on how we can handle timeouts gracefully?

SamirTalwar commented 1 year ago

It appears that cancelling a job means the cleanup steps are cancelled too.

Docker cleanup is one instance of this kind of failure, but we could leave other stuff running on the machine in case a job is cancelled.

Perhaps we could just shut down the instance immediately (accepting no more jobs) if the job failed, was cancelled, etc. and only keep it around if it passed?

toote commented 1 year ago

hi @scriptnull! What you mention is very weird!

To start with, when a job is cancelled the pre-exit hook (where this plugin's cleanup code is) should run anyways. And, even more so, all containers run by this plugin do so using a project named similar to buildkite${BUILDKITE_JOB_ID} so subsequent runs of a pipeline in the same agent should have containers created within different projects so there shouldn't be an issue... unless the containers you are running don't handle the signals sent to them gracefully or in time for them to avoid getting killed outright, or maybe there are dependencies outside of those defined in docker compose that are causing the contentions you describe.

That said, what you are suggesting may be dangerous for those that run several agents in the same machine as a pipeline may try to remove elements from other builds that are actively running. Without more specific information about the pipeline being run and the exact errors it is complicated to understand exactly what is going on and how this plugin can help in that scenario because it doesn't even sound possible :p

buildkite-plugins / docker-compose-buildkite-plugin

clean up fails in case of job timeouts #357