buildkite-plugins / docker-compose-buildkite-plugin

🐳⚡️ Run build scripts, and build + push images, w/ Docker Compose
MIT License
172 stars 139 forks source link

cancelled jobs do not stop main/run container #389

Closed raylu closed 1 year ago

raylu commented 1 year ago

opening this issue as recommended by @toote in https://github.com/buildkite-plugins/docker-compose-buildkite-plugin/pull/386#issuecomment-1536929277

some explanation about what is going on

First, when the job was cancelled as the agent running the job will send the whole process stack a stopping and then a termination signal.

the process group receives a SIGTERM immediately (SIGTSTP is never used): https://github.com/buildkite/agent/blob/b9d4efd1/bootstrap/bootstrap.go#L118 https://github.com/buildkite/agent/blob/b9d4efd1/process/signal.go#L36

The main container should have received that and stopped itself then.

unfortunately, the process group that is signalled is bash, because that is what is running here https://github.com/buildkite-plugins/docker-compose-buildkite-plugin/blob/7d2363cb0c228af195374e0c81b4f9dee649ab62/commands/run.sh#L426. bash creates a new process group for each command run, so the docker compose run ... process never receives any signal and never stops

$ bash
$ echo $$
33366
$ sleep 1000
$ pstree -p 33366
bash(33366)───sleep(34280)
$ kill 33366 # nothing happens
$ kill -- -33366 # nothing happens; this is equivalent to the above because sleep is in its own process group
$ kill -s KILL -- -33366 # bash exits
$ ps x | grep 34280
34280 pts/2    S      0:00 sleep 1000

you can see here that even though the bash process group was sent a SIGKILL, sleep is still running. you can see the process group behavior of bash with ps o pid,pgid,sid | grep 34280

so the agent is never able to stop the main/run container when using docker-compose-buildkite-plugin

If that was not enough, the code just above the lines you added should have also taken care of it by killing all containers associated to the project itself (the main container is part of the project as well).

this I don't have a good understanding of. I agree that that is what should be happening, but that is not what we are seeing, which is why running against my PR causes docker ps to output a container ID and why you see docker stop in the logs (in my PR's description)

toote commented 1 year ago

@raylu thanks a lot for all the context!

How bash is run and the signals being used indeed have a lot to do with expected and actual behaviours. According to bash documentation:

When bash is interactive, in the absence of any traps, it ignores SIGTERM (so that kill 0 does not kill an interactive shell), and SIGINT is caught and handled (so that the wait builtin is interruptible).

As you correctly pointed out, SIGTERM is sent by the agent when a job is cancelled and the comment you pointed out does indicate that SIGINT would be a better one to use. If you want to test that behaviour, the signal to use can also be changed as an agent configuration.

you can see here that even though the bash process group was sent a SIGKILL, sleep is still running. you can see the process group behavior of bash with ps o pid,pgid,sid | grep 34280

This is quite interesting, but not sure what can be done about this as SIGKILL can not be trapped or reviewed. Behaviour for SIGKILL is completely outside the purview of this plugin.

Combining the previous two comments, an alternative that can actually be implemented in this plugin would be to add a trap on SIGTERM and do the cleanup or at least print out a message. Unfortunately, not only I'm not sure what else could be done that is not being done in the pre-exit hook cleanup already; but also, the cleanup itself would be highly dependent on the signal handling processes inside the docker containers being run (and their configurations).

Which leads me to...

If that was not enough, the code just above the lines you added should have also taken care of it by killing all containers associated to the project itself (the main container is part of the project as well).

this I don't have a good understanding of. I agree that that is what should be happening, but that is not what we are seeing, which is why running against my PR causes docker ps to output a container ID and why you see docker stop in the logs (in my PR's description)

The pre-exit hook runs docker stop or docker kill (depending on the graceful-shutdown plugin configuration) on all containers in the project corresponding to the current build. From the docker documentation on stop and kill they do what we expect them to do: send signals to the processes running inside the containers to handle and do what they see fit. What the related PR does is just send the signals a second time to the containers that are still left running around. If that is having an effect that the first round did not have, it points to an issue with your main container and how it is handling signals in the first place; I don't think that sending signals multiple times should be an expected behaviour.

In conclusion, it is my opinion that:

raylu commented 1 year ago

When bash is interactive, in the absence of any traps, it ignores SIGTERM (so that kill 0 does not kill an interactive shell), and SIGINT is caught and handled (so that the wait builtin is interruptible).

SIGINT would be a better one to use. If you want to test that behaviour, the signal to use can also be changed as an agent configuration.

ah! though I don't think SIGINT would change anything if we're not stopping on a SIGTERM

you can see here that even though the bash process group was sent a SIGKILL, sleep is still running. you can see the process group behavior of bash with ps o pid,pgid,sid | grep 34280

This is quite interesting, but not sure what can be done about this as SIGKILL can not be trapped or reviewed. Behaviour for SIGKILL is completely outside the purview of this plugin.

oh the SIGKILL is just for demonstration. I don't think either the agent nor this plugin is managing to get a SIGKILL through to run.sh right now. but anyway, my testing in an interactive shell was wrong and the SIGTERM might be forwarded to docker compose run

The pre-exit hook runs docker stop or docker kill (depending on the graceful-shutdown plugin configuration) on all containers in the project corresponding to the current build. From the docker documentation on stop and kill they do what we expect them to do: send signals to the processes running inside the containers to handle and do what they see fit. What the related PR does is just send the signals a second time to the containers that are still left running around. If that is having an effect that the first round did not have, it points to an issue with your main container and how it is handling signals in the first place; I don't think that sending signals multiple times should be an expected behaviour.

we're not setting GRACEFUL_SHUTDOWN so it defaults to false so we're going straight to SIGKILL. kill can't be handled, so the issue isn't signal handling of our container

the problem is that our main container isn't even being picked up by docker compose kill. we define a docker-compose.yml with 6 containers and tell the agent to run one called "tests" that depends on the other 5. in cleanup, we see the 5 getting killed but not "tests". only with my change do we see the 6th "tests" conatiner being sent anything

toote commented 1 year ago

@raylu I tested with the following docker compose file:

services:
  app:
    image: bash:5.2.15
    depends_on:
      - dep1
      - dep2
      - dep3
      - dep4
      - dep5
  dep1:
    image: bash:5.2.15
    command: sleep 10000
  dep2:
    image: bash:5.2.15
    command: sleep 10000
  dep3:
    image: bash:5.2.15
    command: sleep 10000
  dep4:
    image: bash:5.2.15
    command: sleep 10000
  dep5:
    image: bash:5.2.15
    command: sleep 10000

and a very simple pipeline:

steps:
  - label: test
    command: sleep 1000
    plugins:
      - docker-compose#v4.12.0:
          run: app
          config: compose.yaml

The sleep commands all over the place are to make sure that containers are up and running for me to debug things and using bash all around. And, after playing around for quite a while I found that we were both right.

When I cancel the job running I always get the following line in my logs:

# Received cancellation signal, interrupting
--
  | ERROR: 2
  | 🚨 Error: The command exited with status -1

Which means that the process group received the signal and the docker-compose run command exited with an error, which caused the command hook to print out the message. The fact that you are not seeing that message, is evidence that your container is swallowing up whatever signal is being sent to the process group and not doing anything with it.

Needless to say, a lot more investigation on this matter needs to happen, but it is probably outside the scope of this plugin.... except for:

Just as you are experiencing, if the main container does not stop correctly it will not be stopped nor killed by the cleanup script in the pre-exit hook. While this should not happen as the main container should exit when the job is cancelled, I do believe this to be an oversight and a bug with the current implementation of the plugin.

From my investigations this bug stems from the fact that docker compose does not consider a container started with docker compose run as part of the services so no command will interact with it whatsoever :( Thus explaining why it is not stopped nor killed in the hook as I assumed it would.

Your PR does work for your particular case, but I have a few concerns with it that I'll address in my review of it :)

raylu commented 1 year ago

ah, thanks for doing the test (and reviewing my PR)! I was thinking of doing a similar test, so this definitely saved me some time

I don't think your "app" container is handling SIGTERM. you're not setting init: true so sleep is pid 1, which doesn't get any default signal handlers. you can see this by running it and then trying to docker stop the_container_id. it hangs for 10 seconds before exiting (because docker sends SIGTERM, waits 10 seconds, then SIGKILL)

I actually don't think sending a foregrounded compose process a SIGTERM does anything to the container, even if it handles SIGTERM:

$ docker compose version
Docker Compose version v2.3.3

$ cat test.yml 
services:
  app:
    image: bash:5.2.15
    init: true

$ docker compose -f test.yml run app sleep 1000
$ ps x | grep docker
81253 pts/2    Sl+    0:00 /home/aurelia/.docker/cli-plugins/docker-compose compose -f test.yml run app sleep 1000

$ kill 81253 # I get my prompt back in the other shell

$ docker ps -a
CONTAINER ID   IMAGE       COMMAND                  CREATED          STATUS                       
eb155123e696   bash:5.2.15 "docker-entrypoint.s…"   32 seconds ago   Up 31 seconds

so even after sending my compose run a SIGTERM and watching it exit, the container is still running. I'm not sure why this happens; my reading of the code is compose run is just docker start without --detach by default https://github.com/docker/compose/blob/0c1a691fa5b9b4a68ae00bf9d4b5120a1e2b87b4/cmd/compose/run.go#L246 https://github.com/docker/compose/blob/0c1a691fa5b9b4a68ae00bf9d4b5120a1e2b87b4/pkg/compose/run.go#L38 so it should forward signals. maybe there's something wrong with my test setup

(I also tried this on another system running Docker Compose version v2.12.2; I had to kill -s INT -- -81253 to mimic hitting ctrl+c in bash, but the container still stayed running. then I upgraded all the way up to 2.15.1 in case it had anything to do with https://github.com/docker/compose/issues/9087#issuecomment-1387237234 and was still able to reproduce)

it also looked to us like our tests were stopping when we cancelled a pipeline because the buildkite web interface stopped showing output. but our code spends 10+ minutes running tests and then uploads artifacts (via buildkite-agent artifact upload <path>). we noticed this problem when we saw that halfway through tests, we could cancel and then, a few minutes later, we'd still see the artifacts get uploaded

(our test container uses dumb-init rather than tini from init: true and we aren't blocking SIGTERM, so I do think we're handling SIGTERM. I could be wrong, though)

toote commented 1 year ago

I will review what you are saying. Note, though, that the way the plugin works with the pipeline and setup I mentioned the sleep command is not pid 1 because it is run through a shell with /bin/sh -e -c 'sleep 1000'. The actual command in the log is docker-compose -f compose.yaml -p $PROJECT run --name $NAME --label $LOTSALABELS --rm app /bin/sh -e -c 'sleep 1000'. So /bin/sh is actually pid 1 and sleep a subchild and the signal appears to be handled correctly because it is not an interactive session.

The exact details on the process tree are a moot point because they do not reflect your particular scenario, they are just a basis for my experiment to understand the whole situation. They do point out, though, that the agent is sending the signal to the docker compose run process as it should and when it is handled, the container exits and an error is printed out by the plugin hook. Which was my original point: in your original bug report there is no such message, which means that the signal sent to the docker compose run process is getting swallowed up and ignored.

raylu commented 1 year ago

So /bin/sh is actually pid 1 and sleep a subchild and the signal appears to be handled correctly

I don't think so: https://github.com/mirror/busybox/blob/1_35_0/shell/ash.c#L10532-L10534

$ docker compose -f test.yml run --rm app /bin/sh -e -c 'sleep 1000'
$ docker ps               
CONTAINER ID   IMAGE         COMMAND                  CREATED         STATUS
3de6714090a4   bash:5.2.15   "docker-entrypoint.s…"   3 seconds ago   Up 2 seconds

$  docker exec -it 3de6714090a4 ps x
PID   USER     TIME  COMMAND
    1 root      0:00 sleep 1000
[...]

$ docker stop 3de6714090a4 # hangs for 10 seconds before SIGKILL-ing

the agent is sending the signal to the docker compose run process as it should and when it is handled, the container exits and an error is printed out by the plugin hook.

the agent is signalling docker compose run, which exits with an error printed out by the plugin hook. however, no signal is ever sent to the container and it continues running. you can see this in my test above where directly signalling docker compose run leaves the container up. changing the test to some container with buildkite-agent installed and running sleep 30 && echo hello > /tmp/hello && buildkite-agent artifact upload /tmp/hello will hopefully show the same behavior we're seeing

toote commented 1 year ago

@raylu this issue was closed automatically due to merging the corresponding PR, but feel free to either re-open it or continue the discussion on the underlying issue :)

toote commented 1 year ago
  | ⠿ Container buildkite0188080e66054253ab6ce4b3e3877dec_tests_build_287311  Error while Removing 0.7s
  | Error response from daemon: removal of container 2cbf7301d12f1cac79ca2e69da1d494d7b2d1c2863222487b3b5470d1695d455 is already in progress

@raylu from my understanding this would point out that the docker compose command is still "processing" the original stopping signal, but I could be wrong and am really interested in whatever your investigation turns up

pzeballos commented 1 year ago

We have continued to run some tests here and it would appear that there is indeed an issue where the signal is not being passed on correctly. So I'll reopen this bug and we'll take another look into it

raylu commented 1 year ago

so I looked through old builds that got cancelled and found something very interesting

all of our cancelled builds before 2023-01-31 have the behavior toote saw where ERROR: 2 gets logged and no artifacts are uploaded all of the ones after that don't log that and artifacts get uploaded

on 2023-01-31, we upgraded elastic-ci-stack-for-aws from v5.11.0 to v5.16.1 v5.12.0 added docker compose v2: https://github.com/buildkite/elastic-ci-stack-for-aws/blob/master/CHANGELOG.md#v5120-2022-11-08 the old builds that cancelled successfully run docker-compose -f ... the new builds that don't run docker compose -f ...

armed with this knowledge, I tried

$ cat test.yml
version: '3'
services:
  app:
    image: bash:5.2.15

$ docker-compose -f test.yml run --rm app sleep 1000

I had to add version and get rid of init since v1 doesn't support that. then, sending SIGTERM to the docker-compose run process does nothing for 10 seconds (since sleep isn't handling the signal) and then the container exits!

toote commented 1 year ago

@raylu sorry for the delay but I think I have finally found a satisfying solution to this issue. I have created PR #392 adding signal handling to the command hook so that the main container is stopped appropriately.

raylu commented 1 year ago

I don't think this changes anything, but upstream claims to have fixed the underlying issue https://github.com/docker/compose/issues/10586

older versions of docker compose v2 will still have the problem though, so you probably still want your fix

toote commented 1 year ago

actually, the PR I created changes 2 things: 1- it guarantees that the container will receive a stopping signal with a default plugin configuration.

Without the change the behaviour will be that the container will still be running and gets killed in the pre-exit hook (as per our latest changes) or stopped if you enable the graceful-termination option instead.

2- as this happens during the command step, its output will be visible in the step's build