Closed jocado closed 2 years ago
Hi,
I don't seem to be able to reproduce this.
Here's what I'm doing:
dockerd
running and ecs agent stoppedsudo systemctl start ecs
sudo systemctl stop docker
Nov 16 00:01:58 ip-10-0-1-247.us-west-2.compute.internal sudo[32064]: ec2-user : TTY=pts/2 ; PWD=/home/ec2-user ; USER=root ; COMMAND=/bin/systemctl stop docker
Nov 16 00:01:58 ip-10-0-1-247.us-west-2.compute.internal sudo[32064]: pam_unix(sudo:session): session opened for user root by ec2-user(uid=0)
Nov 16 00:01:58 ip-10-0-1-247.us-west-2.compute.internal systemd[1]: Stopping Amazon Elastic Container Service - container agent...
Nov 16 00:01:58 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[32068]: 2021-11-16T00:01:58Z [INFO] stop
Nov 16 00:01:58 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[32068]: 2021-11-16T00:01:58Z [INFO] Stopping Amazon Elastic Container Service Agent
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[32068]: 2021-11-16T00:01:59Z [INFO] Container name: /ecs-agent
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal dockerd[30841]: time="2021-11-16T00:01:59.266804960Z" level=info msg="ignoring event" container=d3861bf0983c424fdc81041a4f8c5f4db725b8fbc67c072c29fbe223f5869ca2 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal containerd[23767]: time="2021-11-16T00:01:59.266900919Z" level=info msg="shim disconnected" id=d3861bf0983c424fdc81041a4f8c5f4db725b8fbc67c072c29fbe223f5869ca2
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal containerd[23767]: time="2021-11-16T00:01:59.266959772Z" level=error msg="copy shim log" error="read /proc/self/fd/17: file already closed"
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[31821]: 2021-11-16T00:01:59Z [INFO] Agent exited with code 0
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[32068]:
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[31821]:
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[32127]: 2021-11-16T00:01:59Z [INFO] post-stop
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[32127]: 2021-11-16T00:01:59Z [INFO] Cleaning up the credentials endpoint setup for Amazon Elastic Container Service Agent
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal amazon-ecs-init[32127]:
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal systemd[1]: Stopped Amazon Elastic Container Service - container agent.
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal systemd[1]: Stopping Docker Application Container Engine...
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal dockerd[30841]: time="2021-11-16T00:01:59.372893427Z" level=info msg="Processing signal 'terminated'"
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal dockerd[30841]: time="2021-11-16T00:01:59.374549610Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal dockerd[30841]: time="2021-11-16T00:01:59.374839252Z" level=info msg="Daemon shutdown complete"
Nov 16 00:01:59 ip-10-0-1-247.us-west-2.compute.internal systemd[1]: Stopped Docker Application Container Engine.
Could you share the exact repro steps please?
Hi.
Thanks for taking a look at this.
The reproduce it, you will need to replicate the service configuration on this system, because as I mentioned above, you cannot specify dependencies between arbitrary systemd service units. I also just found another piece of this puzzle, it's that the service responsible for docker on this system also removes the socket file after stopping docker.
Here are the reproduction steps:
# Update service config to remove inter service dependency
cp /lib/systemd/system/ecs.service /etc/systemd/system/
sed -i '/docker.service$/d' /etc/systemd/system/ecs.service
systemctl daemon-reload
# Stop docker
systemctl stop docker.socket
systemctl stop docker.service
# Remove socket
rm /var/run/docker.sock
Then you will see messages in the ecs service journal such as:
Nov 16 14:50:54 ip-10-0-1-119 systemd[1]: ecs.service: Control process exited, code=exited, status=255/EXCEPTION
Nov 16 14:50:54 ip-10-0-1-119 amazon-ecs-init[3497486]: 2021-11-16T14:50:54Z [INFO] Error connecting to docker, backing off for 1.14777941s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
Nov 16 14:50:55 ip-10-0-1-119 amazon-ecs-init[3497486]: 2021-11-16T14:50:55Z [INFO] Error connecting to docker, backing off for 2.282153551s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
Nov 16 14:50:57 ip-10-0-1-119 amazon-ecs-init[3497486]: 2021-11-16T14:50:57Z [INFO] Error connecting to docker, backing off for 4.466145821s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
Nov 16 14:51:02 ip-10-0-1-119 amazon-ecs-init[3497486]: 2021-11-16T14:51:02Z [INFO] Error connecting to docker, backing off for 5.235010051s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
Nov 16 14:51:07 ip-10-0-1-119 amazon-ecs-init[3497486]: 2021-11-16T14:51:07Z [INFO] Error connecting to docker, backing off for 5.287113937s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
Those are the same backoff messages I get from the ecs service when trying to run an ecs-init post-stop
on system shutdown if docker has already stopped running.
I was able to repro the issue, but I'm afraid the errors are not coming from post-stop
. Here is what happens:
StartSupervised
function then proceeds to end, returning a nil error heresystemd
restarts it automaticallyengine.New()
-> docker.NewClient()
-> newDockerClient
-> "Error connecting to docker...
Therefore, systemd is just trying to restart ecs-init, which in turn is trying to restart ecs agent. stop, and post-stop are never actually executed with the repro steps you provided.
Thoughts?
I think our situation is as a result of what you may currently consider a corner case. I think in our case docker and ecs are being stopped at the same time, but with the above dependency and socket caveats, which seems to result the unit running ecs-init stop
before the runing amazon-init process exists with an error.
We are running on a not officially supported OS, Ubuntu Core 20 [ based on 20.04 components ], so I don't want to waist your time trying to reproduce something that may be tricky.
The question remains though, should amazon-ecs-init post-stop
require the docker socket to exist ? Does the current code require it ?
Hmmm, ok, that's interesting. I will try and show why it appears like that from the situation I have, perhaps that will help shed some light on it.
As I mentioned above, when we do a system shutdown, the ecs-init service gets stopped when the docker service isn't running and the docker socket is removed. This causes significant delays on shutdown while there are backoff retries from ecs-init while the service unit runs the ecs-init stop
and post-stop
To prevent the delays, we added the following workarounds:
stop
was replaced with this:
#!/bin/bash
[ -S /var/run/docker.sock ] || echo "Not running ecs-init stop, as docker is not available"
[ -S /var/run/docker.sock ] && ${SNAP}/bin/amazon-ecs-init stop
post-stop
was replaced with this:
#!/bin/bash
[ -S /var/run/docker.sock ] || echo "Not running ecs-init post-stop, as docker is not available"
[ -S /var/run/docker.sock ] && ${SNAP}/bin/amazon-ecs-init post-stop
The above approach means that if docker is running then it works normally [ e.g. any normal service restart or stop ], but in system shutdown it doesn't timeout if the docker socket is not available.
It works as expected, and produces the following logs:
2021-11-18T08:36:40Z systemd[1]: Stopping Service for snap application ecs-anywhere.ecs-init...
2021-11-18T08:36:41Z ecs-anywhere.ecs-init[11803]: Not running ecs-init stop, as docker is not available
2021-11-18T08:36:41Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Control process exited, code=exited, status=1/FAILURE
2021-11-18T08:36:41Z ecs-anywhere.ecs-init[11856]: Not running ecs-init post-stop, as docker is not available
2021-11-18T08:36:41Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Control process exited, code=exited, status=1/FAILURE
2021-11-18T08:36:41Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Failed with result 'exit-code'.
2021-11-18T08:36:41Z systemd[1]: Stopped Service for snap application ecs-anywhere.ecs-init.
If I remove the socket check from the post-stop
, like this:
#!/bin/bash
[ -S /var/run/docker.sock ] || echo "Will run ecs-init post-stop, even though docker socket is not available"
${SNAP}/bin/amazon-ecs-init post-stop
I hangs on shutdown, which you can see in the following logs:
2021-11-18T08:51:04Z systemd[1]: Stopping Service for snap application ecs-anywhere.ecs-init...
2021-11-18T08:51:05Z ecs-anywhere.ecs-init[19141]: Not running ecs-init stop, as docker is not available
2021-11-18T08:51:05Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Control process exited, code=exited, status=1/FAILURE
2021-11-18T08:51:05Z ecs-anywhere.ecs-init[9973]:
2021-11-18T08:51:05Z ecs-anywhere.ecs-init[19187]: Will run ecs-init post-stop, even though docker socket is not available
2021-11-18T08:51:11Z ecs-anywhere.ecs-init[19222]: 2021-11-18T08:51:11Z [WARN] Could not retrieve the region from EC2 Instance Metadata. Error: %sEC2MetadataRequestError: failed to get EC2 instance identity document
2021-11-18T08:51:11Z ecs-anywhere.ecs-init[19222]: caused by: RequestError: send request failed
2021-11-18T08:51:11Z ecs-anywhere.ecs-init[19222]: caused by: Get http://169.254.169.254/latest/dynamic/instance-identity/document: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2021-11-18T08:51:11Z ecs-anywhere.ecs-init[19222]: 2021-11-18T08:51:11Z [INFO] Error connecting to docker, backing off for 1.14777941s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-18T08:51:12Z ecs-anywhere.ecs-init[19222]: 2021-11-18T08:51:12Z [INFO] Error connecting to docker, backing off for 2.282153551s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-18T08:51:14Z ecs-anywhere.ecs-init[19222]: 2021-11-18T08:51:14Z [INFO] Error connecting to docker, backing off for 4.466145821s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-18T08:51:19Z ecs-anywhere.ecs-init[19222]: 2021-11-18T08:51:19Z [INFO] Error connecting to docker, backing off for 5.235010051s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-18T08:51:24Z ecs-anywhere.ecs-init[19222]: 2021-11-18T08:51:24Z [INFO] Error connecting to docker, backing off for 5.287113937s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-18T08:51:29Z ecs-anywhere.ecs-init[19222]: 2021-11-18T08:51:29Z [ERROR] Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-18T08:51:29Z ecs-anywhere.ecs-init[19222]:
2021-11-18T08:51:29Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Control process exited, code=exited, status=255/EXCEPTION
2021-11-18T08:51:29Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Failed with result 'exit-code'.
2021-11-18T08:51:29Z systemd[1]: Stopped Service for snap application ecs-anywhere.ecs-init.
Some of the delay appears to be trying to reach the ec2 metadata service, but the bulk of the time seems to be trying to connect to docker.
Am I simply misinterpreting this ?
Hmm, that's very interesting. I still really don't see how post-stop
requires docker, but admittedly I can't explain the behavior observed in the logs above when post-stop
is executed.
To further investigate, could you repeat the scenario where you don't execute stop
; you DO execute post-stop
, but this time also add an echo
to both, pre-start
, and start
? this is to see if any of the latter is executed after stop
.
I added that debugging. We have both pre-start
and start
in the same wrapper, which now looks like this:
#!/bin/bash
set -a
source ${SNAP_DATA}/etc/ecs/ecs.config
source ${SNAP_DATA}/var/lib/ecs/ecs.config
set +a
echo "Will run ecs-init pre-start"
${SNAP}/bin/amazon-ecs-init pre-start
echo "Finished ecs-init pre-start"
echo "Will run ecs-init start"
${SNAP}/bin/amazon-ecs-init start
echo "Finished ecs-init start"
But, it doesn't seem to be executed during the system shutdown. Here are the logs:
2021-11-19T08:52:44Z systemd[1]: Stopping Service for snap application ecs-anywhere.ecs-init...
2021-11-19T08:52:44Z ecs-anywhere.ecs-init[16760]: Not running ecs-init stop, as docker is not available
2021-11-19T08:52:44Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Control process exited, code=exited, status=1/FAILURE
2021-11-19T08:52:44Z ecs-anywhere.ecs-init[14952]:
2021-11-19T08:52:44Z ecs-anywhere.ecs-init[16829]: Will run ecs-init post-stop, even though docker socket is not available
2021-11-19T08:52:50Z ecs-anywhere.ecs-init[16893]: 2021-11-19T08:52:50Z [WARN] Could not retrieve the region from EC2 Instance Metadata. Error: %sEC2MetadataRequestError: failed to get EC2 instance identity document
2021-11-19T08:52:50Z ecs-anywhere.ecs-init[16893]: caused by: RequestError: send request failed
2021-11-19T08:52:50Z ecs-anywhere.ecs-init[16893]: caused by: Get http://169.254.169.254/latest/dynamic/instance-identity/document: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2021-11-19T08:52:50Z ecs-anywhere.ecs-init[16893]: 2021-11-19T08:52:50Z [INFO] Error connecting to docker, backing off for 1.14777941s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-19T08:52:52Z ecs-anywhere.ecs-init[16893]: 2021-11-19T08:52:52Z [INFO] Error connecting to docker, backing off for 2.282153551s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-19T08:52:54Z ecs-anywhere.ecs-init[16893]: 2021-11-19T08:52:54Z [INFO] Error connecting to docker, backing off for 4.466145821s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-19T08:52:58Z ecs-anywhere.ecs-init[16893]: 2021-11-19T08:52:58Z [INFO] Error connecting to docker, backing off for 5.235010051s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-19T08:53:04Z ecs-anywhere.ecs-init[16893]: 2021-11-19T08:53:04Z [INFO] Error connecting to docker, backing off for 5.287113937s, error: Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-19T08:53:09Z ecs-anywhere.ecs-init[16893]: 2021-11-19T08:53:09Z [ERROR] Get http://unix.sock/v1.25/version: dial unix /var/run/docker.sock: connect: no such file or directory
2021-11-19T08:53:09Z ecs-anywhere.ecs-init[16893]:
2021-11-19T08:53:09Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Control process exited, code=exited, status=255/EXCEPTION
2021-11-19T08:53:09Z systemd[1]: snap.ecs-anywhere.ecs-init.service: Failed with result 'exit-code'.
2021-11-19T08:53:09Z systemd[1]: Stopped Service for snap application ecs-anywhere.ecs-init.
Seems to be a bit of a mystery!
Thanks for providing the additional info. I think I know what's happening. The problem is that the docker client is created on ecs-init's main function for all the actions, even if it's not going to be used (like in post-stop
).
I'll try and get a PR to address this and post an update here when the change is released.
@angelcar That's great - thank you.
Please let me know if you need any further information.
FYI @stuart-warren
Hi,
The change has been released as part of ECS agent 1.57.1.
I will close this issue. Kindly reopen if you face further problems.
Summary
When running ecs-init
post-stop
, it seems to require connecting to dockerd socket.Description
We are using a system that does not allow specifying dependencies between arbitrary systemd service units. Therefor we are not able to prevent ecs-init stop operations before dockerd is stopped in any graceful way.
To prevent delays on shutdown due to timeouts connecting to dockerd, we are forced implement a wrapper that check if dockerd socket exists before trying to run either
stop
orpost-stop
. This part is not a concern of ics-init :slightly_smiling_face: , but we would like to only implement this for the parts strictly required.However, according to docs and source code, the purpose of
post-stop
is:So, it's not clear why this should need to connect to dockerd socket. However we do see it timing out trying to connect in the ecs-init logs when stop-stop runs.
Is it possible that post-stop requiring connecting to dockerd is a bug, and if not could you perhaps give a brief explanation on why that is required.
Expected Behavior
post-stop does not require connecting to dockerd
Observed Behavior
post-stop triggers connection attempts to dockerd socket.
Environment Details
Ubuntu 20.04 ECS Init v1.52 Docker 19.03.13
Supporting Log Snippets