Failed container SIGTERM + SIGKILL with `(HTTP code 409) unexpected - You cannot remove a running container`

cywang117 commented 3 years ago

Currently, when executing a service kill step, the Supervisor throws any daemon HTTP codes that aren't 304 Not Modified (container already stopped) or 404 Not Found (container removed). See here. There is a condition where the engine will fail on both the SIGTERM (docker stop) and the SIGKILL 10 seconds later (docker kill).

With this in mind, the Supervisor should force-terminate the old container so that the device isn't stuck in an invalid state of the Supervisor attempting to kill the container. Either that, or https://github.com/balena-os/balena-supervisor/issues/1743 would also avoid the invalid state by ensuring that the Supervisor wouldn't continue attempting to kill a container that doesn't want to be killed after a number of attempts and revert the state.

NOTE: For support agents, to see if kill -9 will have an effect, please follow the steps listed in one of my comments below to try to kill a container that refuses to stop.

jellyfish-bot commented 3 years ago

[cywang117] This issue has attached support thread https://jel.ly.fish/24ed7998-2398-46be-8653-100adff6eb90

cywang117 commented 3 years ago

@20k-ultra @pipex Do you guys think forcing a container kill with the commands below is reasonable in the advent of a 409?

Get the "ctr-pid" using docker inspect -f "{{.State.Pid}}"
Then force-kill the "ctr-pid" process, using kill -9 ; this should stop the container

Source: https://github.com/moby/moby/issues/32938#issuecomment-642205013

jellyfish-bot commented 3 years ago

[hades32] This issue has attached support thread https://jel.ly.fish/2c211add-d2ee-492b-8197-ab46b4f2a3cf

pipex commented 3 years ago

I think the engine already does that @cywang117, when telling a container to stop, it sends a SIGTERM, waits 10 seconds and then sends a SIGKILL. If that is not working I'm not sure the supervisor would succeed. If we see a device in that state, it might be worth trying those steps manually to see if they have an effect.

20k-ultra commented 3 years ago

I think the obvious issue is that the Supervisor is trying to rm a container which is still running so make the Supervisor handle containers which are refusing to exit better before trying to remove it.

jellyfish-bot commented 3 years ago

[cywang117] This issue has attached support thread https://jel.ly.fish/efd68e13-8d5b-4e4a-b25f-a231d519dd4a

cywang117 commented 3 years ago

^ I saw this again on a user's device from that ticket. Reporting that kill -9 $(balena inspect CONTAINER_NAME -f "{{.State.Pid}}") did not work, however, balena rm --force CONTAINER_NAME did. Maybe when we catch 2 classes of error codes here, we should also catch a 409 and execute a force removal with that catch -- what do you guys think @pipex @20k-ultra?

pipex commented 3 years ago

From the code you linked, the catch clause should be invoked either if the stop() call fails or the remove call fails. Given the error message, it looks like the stop call is returning misleading information, since the call to remove is the one to fail. I'd like to try to figure out the underlying cause for this before trying to catch the 409 which may have multiple causes. A couple of ideas to mind

Test the same operation with the engine, stop the container and see the result balena stop <container> && echo "OK". If we see "OK" and the container didn't really stop, then is a engine issue
Create a docker image with node and dockerode and try to stop the engine manually docker.getContainer(<containerId>).stop().then(() => console.log("OK")) if this returns OK, then the issue is with dockerode.
Otherwise the issue might be with the supervisor (although I don't think so)

pipex commented 3 years ago

One more thing. It might be also worth checking the result of

 curl -v  --unix-socket /var/run/balena-engine.sock -X POST http://localhost/containers/:containerId/stop

and

 curl -v  --unix-socket /var/run/balena-engine.sock -X POST http://localhost/containers/:containerId/kill

Under those conditions, to see the raw response from the engine

jellyfish-bot commented 3 years ago

[pipex] This issue has attached support thread https://jel.ly.fish/d8671fa6-81be-44ac-8e71-3f3376c3719a

jellyfish-bot commented 2 years ago

[pipex] This issue has attached support thread https://jel.ly.fish/b7fa70df-ad99-4deb-8f6a-2b78d2f47a44

pipex commented 2 years ago

Caught a device showing this issue, this is the response from the engine even though stop failed

root@0a4fb60:~#  curl -v  --unix-socket /var/run/balena-engine.sock -X POST http://localhost/containers/fb10333f5237/stop
*   Trying /var/run/balena-engine.sock:0...
* Connected to localhost (/var/run/balena-engine.sock) port 80 (#0)
> POST /containers/fb10333f5237/stop HTTP/1.1
> Host: localhost
> User-Agent: curl/7.69.1
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 204 No Content
< Api-Version: 1.40
< Docker-Experimental: true
< Ostype: linux
< Server: Docker/19.03.24 (linux)
< Date: Wed, 05 Jan 2022 14:33:49 GMT
< 
* Connection #0 to host localhost left intact

pipex commented 2 years ago

Same result from dockerode, no error thrown by the engine, which is why the supervisor proceeds with container removal

docker.getContainer('fb10333f52371bfdbc7b280f8f0de5d671f0857a2a27a2b7667a3fa9cb362070').stop().then(b => console.log(b.toString('utf8'))).catch(console.trace) 
Promise {
  <pending>,
  [Symbol(async_id_symbol)]: 1154,
  [Symbol(trigger_async_id_symbol)]: 1153,
  [Symbol(destroyed)]: { destroyed: false }
}

jellyfish-bot commented 2 years ago

[klutchell] This issue has attached support thread https://jel.ly.fish/88aadc7a-408c-44f7-8564-f18de52129a5

balena-os / balena-supervisor

Failed container SIGTERM + SIGKILL with `(HTTP code 409) unexpected - You cannot remove a running container` #1754