improve handling of SIGTERM grace period

BenjiHansell commented 10 months ago

This PR makes this plugin behave more similarly to non-Docker jobs with regards to cancellation and timing out.

Consider the following scenario: You have a job which can timeout (or be cancelled by the user), but you want to shutdown gracefully, and conditionally mark the job as passed, if the key parts of the job succeeded before the cancellation.

A typical example would be a benchmark job. When the timeout is hit, it is possible that enough iterations of the benchmark have already run to give a pass.

When creating a normal (non-Docker) Buildkite job, this is easy. In your test process you add a signal handler for SIGTERM (which is what Buildkite sends on timeout or user cancellation) which conditionally returns exit code 0 (instead of the typical 143). If 0 is returned, Buildkite interprets that as a pass and marks the job with a green tick.

Unfortunately, this doesn't currently work with the Docker plugin. So I've made the following changes:

I've added some trap calls to the script to ensure that SIGTERM only affects the Docker container, not the wrapper bash script. Without this, SIGTERM was causing the bash script to terminate and return 143, which meant the exit code from the Docker container was being ignored. Now, if the process inside the Docker container chooses to intercept the SIGTERM and e.g. return 0, then the exit code is propagated all the way up to Buildkite UI (which displays a green tick if the exit code is 0).
BUILDKITE_CANCEL_GRACE_PERIOD is now passed to the Docker container: When a user presses cancel (or a job hits its timeout), Buildkite sends SIGTERM to the job. The process is given this grace period to gracefully terminate, before Buildkite resorts to sending a SIGKILL. Docker itself has a similar flag option --stop-timeout, which was previously defaulting to 10 seconds. So, with that Docker default taking effect, increasing BUILDKITE_CANCEL_GRACE_PERIOD had no effect.

toote commented 9 months ago

@BenjiHansell great idea!

I believe that using the exact same grace period as the agent has for the job may not be the best idea as it may lead to race conditions of the agent sending the kill signal and docker doing the same thing to the container (thus you may still see the same behaviour sometimes). The PR is also missing a test for when the BUILDKITE_CANCEL_GRACE_PERIOD is set to something other than 10... and, on top of that, the fact that the --stop-timeout is added unconditionally to all executions broke all existing tests.

An easy way to remediate all of those would be to add a new option to the plugin (maybe called STOP_TIMEOUT?) that is only added when the option in the plugin is specified. With that, you would still have to add a single test, but all existing ones would not break :)

BenjiHansell commented 9 months ago

@BenjiHansell great idea!

I believe that using the exact same grace period as the agent has for the job may not be the best idea as it may lead to race conditions of the agent sending the kill signal and docker doing the same thing to the container (thus you may still see the same behaviour sometimes). The PR is also missing a test for when the BUILDKITE_CANCEL_GRACE_PERIOD is set to something other than 10... and, on top of that, the fact that the --stop-timeout is added unconditionally to all executions broke all existing tests.

An easy way to remediate all of those would be to add a new option to the plugin (maybe called STOP_TIMEOUT?) that is only added when the option in the plugin is specified. With that, you would still have to add a single test, but all existing ones would not break :)

Hi @toote

Thanks for your response 😊

I've fixed the tests (they pass locally now at least) and added a test for this feature.

Regarding making --stop-timeout user-configurable; I don't think we should:

This race condition isn't new. It has always been the case that both grace periods were defaulting to 10 seconds. So, if this were an issue, I think it would have been noticed by now. I suspect that it is indeed racy, but the two possible outcomes are effectively the same as far as the user is concerned.
If users do have problems with this race condition, and need to manually configure a config option to work around the it, then that wouldn't be a great user-experience. We should ideally provide an automatic solution.
Other than working around this race condition, I can't think of a scenario when the user would want to configure Docker's grace period to be different to agent's grace period.

Adding this config option would be easy. But if it turns out to not be useful, removing it will not be so easy. So, I suggest that we don't introduce a new config option unless a user requests it, with a compelling use case. Unless you already have another use case in mind?

buildkite-plugins / docker-buildkite-plugin

improve handling of SIGTERM grace period #261