Closed BenjiHansell closed 9 months ago
@BenjiHansell great idea!
I believe that using the exact same grace period as the agent has for the job may not be the best idea as it may lead to race conditions of the agent sending the kill signal and docker doing the same thing to the container (thus you may still see the same behaviour sometimes). The PR is also missing a test for when the BUILDKITE_CANCEL_GRACE_PERIOD
is set to something other than 10... and, on top of that, the fact that the --stop-timeout
is added unconditionally to all executions broke all existing tests.
An easy way to remediate all of those would be to add a new option to the plugin (maybe called STOP_TIMEOUT
?) that is only added when the option in the plugin is specified. With that, you would still have to add a single test, but all existing ones would not break :)
@BenjiHansell great idea!
I believe that using the exact same grace period as the agent has for the job may not be the best idea as it may lead to race conditions of the agent sending the kill signal and docker doing the same thing to the container (thus you may still see the same behaviour sometimes). The PR is also missing a test for when the
BUILDKITE_CANCEL_GRACE_PERIOD
is set to something other than 10... and, on top of that, the fact that the--stop-timeout
is added unconditionally to all executions broke all existing tests.An easy way to remediate all of those would be to add a new option to the plugin (maybe called
STOP_TIMEOUT
?) that is only added when the option in the plugin is specified. With that, you would still have to add a single test, but all existing ones would not break :)
Hi @toote
Thanks for your response 😊
I've fixed the tests (they pass locally now at least) and added a test for this feature.
Regarding making --stop-timeout
user-configurable; I don't think we should:
Adding this config option would be easy. But if it turns out to not be useful, removing it will not be so easy. So, I suggest that we don't introduce a new config option unless a user requests it, with a compelling use case. Unless you already have another use case in mind?
This PR makes this plugin behave more similarly to non-Docker jobs with regards to cancellation and timing out.
Consider the following scenario: You have a job which can timeout (or be cancelled by the user), but you want to shutdown gracefully, and conditionally mark the job as passed, if the key parts of the job succeeded before the cancellation.
A typical example would be a benchmark job. When the timeout is hit, it is possible that enough iterations of the benchmark have already run to give a pass.
When creating a normal (non-Docker) Buildkite job, this is easy. In your test process you add a signal handler for SIGTERM (which is what Buildkite sends on timeout or user cancellation) which conditionally returns exit code 0 (instead of the typical 143). If 0 is returned, Buildkite interprets that as a pass and marks the job with a green tick.
Unfortunately, this doesn't currently work with the Docker plugin. So I've made the following changes:
trap
calls to the script to ensure thatSIGTERM
only affects the Docker container, not the wrapper bash script. Without this,SIGTERM
was causing the bash script to terminate and return 143, which meant the exit code from the Docker container was being ignored. Now, if the process inside the Docker container chooses to intercept theSIGTERM
and e.g. return 0, then the exit code is propagated all the way up to Buildkite UI (which displays a green tick if the exit code is 0).BUILDKITE_CANCEL_GRACE_PERIOD
is now passed to the Docker container: When a user presses cancel (or a job hits its timeout), Buildkite sendsSIGTERM
to the job. The process is given this grace period to gracefully terminate, before Buildkite resorts to sending aSIGKILL
. Docker itself has a similar flag option--stop-timeout
, which was previously defaulting to 10 seconds. So, with that Docker default taking effect, increasingBUILDKITE_CANCEL_GRACE_PERIOD
had no effect.