Open cywang117 opened 3 years ago
[cywang117] This issue has attached support thread https://jel.ly.fish/24ed7998-2398-46be-8653-100adff6eb90
@20k-ultra @pipex Do you guys think forcing a container kill with the commands below is reasonable in the advent of a 409?
Source: https://github.com/moby/moby/issues/32938#issuecomment-642205013
[hades32] This issue has attached support thread https://jel.ly.fish/2c211add-d2ee-492b-8197-ab46b4f2a3cf
I think the engine already does that @cywang117, when telling a container to stop, it sends a SIGTERM, waits 10 seconds and then sends a SIGKILL. If that is not working I'm not sure the supervisor would succeed. If we see a device in that state, it might be worth trying those steps manually to see if they have an effect.
I think the obvious issue is that the Supervisor is trying to rm a container which is still running so make the Supervisor handle containers which are refusing to exit better before trying to remove it.
[cywang117] This issue has attached support thread https://jel.ly.fish/efd68e13-8d5b-4e4a-b25f-a231d519dd4a
^ I saw this again on a user's device from that ticket. Reporting that kill -9 $(balena inspect CONTAINER_NAME -f "{{.State.Pid}}")
did not work, however, balena rm --force CONTAINER_NAME
did. Maybe when we catch 2 classes of error codes here, we should also catch a 409 and execute a force removal with that catch -- what do you guys think @pipex @20k-ultra?
From the code you linked, the catch clause should be invoked either if the stop()
call fails or the remove
call fails. Given the error message, it looks like the stop
call is returning misleading information, since the call to remove
is the one to fail. I'd like to try to figure out the underlying cause for this before trying to catch the 409 which may have multiple causes. A couple of ideas to mind
balena stop <container> && echo "OK"
. If we see "OK" and the container didn't really stop, then is a engine issuedocker.getContainer(<containerId>).stop().then(() => console.log("OK"))
if this returns OK, then the issue is with dockerode.One more thing. It might be also worth checking the result of
curl -v --unix-socket /var/run/balena-engine.sock -X POST http://localhost/containers/:containerId/stop
and
curl -v --unix-socket /var/run/balena-engine.sock -X POST http://localhost/containers/:containerId/kill
Under those conditions, to see the raw response from the engine
[pipex] This issue has attached support thread https://jel.ly.fish/d8671fa6-81be-44ac-8e71-3f3376c3719a
[pipex] This issue has attached support thread https://jel.ly.fish/b7fa70df-ad99-4deb-8f6a-2b78d2f47a44
Caught a device showing this issue, this is the response from the engine even though stop failed
root@0a4fb60:~# curl -v --unix-socket /var/run/balena-engine.sock -X POST http://localhost/containers/fb10333f5237/stop
* Trying /var/run/balena-engine.sock:0...
* Connected to localhost (/var/run/balena-engine.sock) port 80 (#0)
> POST /containers/fb10333f5237/stop HTTP/1.1
> Host: localhost
> User-Agent: curl/7.69.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 204 No Content
< Api-Version: 1.40
< Docker-Experimental: true
< Ostype: linux
< Server: Docker/19.03.24 (linux)
< Date: Wed, 05 Jan 2022 14:33:49 GMT
<
* Connection #0 to host localhost left intact
Same result from dockerode, no error thrown by the engine, which is why the supervisor proceeds with container removal
docker.getContainer('fb10333f52371bfdbc7b280f8f0de5d671f0857a2a27a2b7667a3fa9cb362070').stop().then(b => console.log(b.toString('utf8'))).catch(console.trace)
Promise {
<pending>,
[Symbol(async_id_symbol)]: 1154,
[Symbol(trigger_async_id_symbol)]: 1153,
[Symbol(destroyed)]: { destroyed: false }
}
[klutchell] This issue has attached support thread https://jel.ly.fish/88aadc7a-408c-44f7-8564-f18de52129a5
Currently, when executing a service kill step, the Supervisor throws any daemon HTTP codes that aren't 304 Not Modified (container already stopped) or 404 Not Found (container removed). See here. There is a condition where the engine will fail on both the SIGTERM (
docker stop
) and the SIGKILL 10 seconds later (docker kill
).With this in mind, the Supervisor should force-terminate the old container so that the device isn't stuck in an invalid state of the Supervisor attempting to kill the container. Either that, or https://github.com/balena-os/balena-supervisor/issues/1743 would also avoid the invalid state by ensuring that the Supervisor wouldn't continue attempting to kill a container that doesn't want to be killed after a number of attempts and revert the state.
NOTE: For support agents, to see if
kill -9
will have an effect, please follow the steps listed in one of my comments below to try to kill a container that refuses to stop.