aws / amazon-ecs-init

ecs-init is now part of the amazon-ecs-agent repo https://github.com/aws/amazon-ecs-agent/tree/master/ecs-init
https://github.com/aws/amazon-ecs-agent
Apache License 2.0
200 stars 117 forks source link

ECS service doesn't always run at boot time due to systemd breaking service ordering cycles #377

Closed kjkuan closed 3 years ago

kjkuan commented 3 years ago

Summary

ECS service doesn't always run at boot time due to systemd breaking service ordering cycles

Description

Not sure if this is the right place to report this, but we've found that at least with amzn2-ami-ecs-hvm-2.0.20201119-x86_64-ebs, there appears to be some service ordering cycles, which systemd tries to break by removing a dependency (e.g., sometimes rpcbind.socket or docker.socket), which leads to Docker and the ECS services not being started due to the missing dependency.

You can see such behavior in action by running this command several times and observing the output:

systemd-analyze verify default.target

Below is the actual logs taken from journalctl -x run on an instance that failed to join an ECS cluster because no Docker nor ECS agent were running after boot:

Nov 26 02:42:50 localhost systemd[1]: systemd 219 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN)
Nov 26 02:42:50 localhost systemd[1]: Detected virtualization kvm.
Nov 26 02:42:50 localhost systemd[1]: Detected architecture x86-64.
Nov 26 02:42:50 localhost systemd[1]: Cannot add dependency job for unit display-manager.service, ignoring: Unit not found.
Nov 26 02:42:50 localhost systemd[1]: Found ordering cycle on sockets.target/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on rpcbind.socket/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on sysinit.target/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on rngd.service/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on rngd-wake-threshold.service/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on basic.target/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on sockets.target/start
Nov 26 02:42:50 localhost systemd[1]: Breaking ordering cycle by deleting job rpcbind.socket/start
Nov 26 02:42:50 localhost systemd[1]: Job rpcbind.socket/start deleted to break ordering cycle starting with sockets.target/start
Nov 26 02:42:50 localhost systemd[1]: Found ordering cycle on sockets.target/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on amazon-ecs-volume-plugin.socket/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on sysinit.target/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on rngd.service/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on rngd-wake-threshold.service/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on basic.target/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on sockets.target/start
Nov 26 02:42:50 localhost systemd[1]: Breaking ordering cycle by deleting job amazon-ecs-volume-plugin.socket/start
Nov 26 02:42:50 localhost systemd[1]: Job amazon-ecs-volume-plugin.socket/start deleted to break ordering cycle starting with sockets.target/start
Nov 26 02:42:50 localhost systemd[1]: Found ordering cycle on sockets.target/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on dbus.socket/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on sysinit.target/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on rngd.service/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on rngd-wake-threshold.service/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on basic.target/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on sockets.target/start
Nov 26 02:42:50 localhost systemd[1]: Breaking ordering cycle by deleting job dbus.socket/start
Nov 26 02:42:50 localhost systemd[1]: Job dbus.socket/start deleted to break ordering cycle starting with sockets.target/start
Nov 26 02:42:50 localhost systemd[1]: Found ordering cycle on basic.target/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on paths.target/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on brandbot.path/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on sysinit.target/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on rngd.service/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on rngd-wake-threshold.service/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on basic.target/start
Nov 26 02:42:50 localhost systemd[1]: Breaking ordering cycle by deleting job paths.target/start
Nov 26 02:42:50 localhost systemd[1]: Job paths.target/start deleted to break ordering cycle starting with basic.target/start
Nov 26 02:42:50 localhost systemd[1]: Found ordering cycle on sockets.target/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on docker.socket/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on sysinit.target/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on rngd.service/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on rngd-wake-threshold.service/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on basic.target/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on sockets.target/start
Nov 26 02:42:50 localhost systemd[1]: Breaking ordering cycle by deleting job docker.socket/start
Nov 26 02:42:50 localhost systemd[1]: Job docker.socket/start deleted to break ordering cycle starting with sockets.target/start
Nov 26 02:42:50 localhost systemd[1]: Found ordering cycle on basic.target/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on sysinit.target/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on rngd.service/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on rngd-wake-threshold.service/start
Nov 26 02:42:50 localhost systemd[1]: Found dependency on basic.target/start
Nov 26 02:42:50 localhost systemd[1]: Breaking ordering cycle by deleting job rngd.service/start
Nov 26 02:42:50 localhost systemd[1]: Job rngd.service/start deleted to break ordering cycle starting with basic.target/start
Nov 26 02:42:51 localhost systemd[1]: Starting Flush Journal to Persistent Storage...

Expected Behavior

ECS agent should be running after instance boot.

Observed Behavior

See description.

Environment Details

EC2 instance running AMI: amzn2-ami-ecs-hvm-2.0.20201119-x86_64-ebs

Supporting Log Snippets

ellenthsu commented 3 years ago

Thanks for reporting this, we've released an updated AMI, amzn2-ami-ecs-hvm-2.0.20201125-x86_64-ebs, which contains the fix for this issue. Please let us know if you're still hitting this issue with the updated AMI.

kjkuan commented 3 years ago

Thanks @ellenthsu We willl give that a try. Just curious, is there a place where I can see the release notes with a summary of changes between the releases for the ECS optimized AMIs?

yhlee-aws commented 3 years ago

@kjkuan the release pages along with AWS doc should provide the information: https://github.com/aws/amazon-ecs-agent/releases https://github.com/aws/amazon-ecs-init/releases https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-ami-versions.html

kjkuan commented 3 years ago

The issue hasn't happened since we switched to the new AMI. Thanks! Closing this issue.