aws / amazon-ecs-agent

Amazon Elastic Container Service Agent
http://aws.amazon.com/ecs/
Apache License 2.0
2.08k stars 612 forks source link

ECS agent apparently stuck in infinite restart loop #1257

Closed alexhall closed 6 years ago

alexhall commented 6 years ago

Summary

An ECS container instance launched from an EC2 autoscaling group using the EC2-optimized AMI became disconnected from the cluster. The apparent reason was that the ECS agent on the instance was restarting every few seconds.

Description

I created an ECS cluster attached to an EC2 autoscaling group, along with an ECS service and task, using a CloudFormation template. After a couple of updates to the CFN template, the service failed to update with message: service ronaldo-docker-staging-service-1MBJ5GTPZJ6WM was unable to place a task because no container instance met all of its requirements. The closest matching container-instance 99024b43-6851-4f3e-8867-cdae0bf0c8ec encountered error "AGENT". (This was a staging environment, so the autoscaling group had a desired count of 1.)

I ssh'ed into the ECS container instance and followed the troubleshooting instructions in https://aws.amazon.com/premiumsupport/knowledge-center/ecs-agent-disconnected/.

Running docker ps multiple times in succession showed new container IDs on each invocation, with an uptime of less than 10 seconds, which seems to indicate that the agent is getting constant restarted and cannot establish a connection with the ECS cluster.

Environment Details

ECS container instance launched via an EC2 autoscaling group using ami-5e414e24.

Supporting Log Snippets

From the ECS agent logfile:

2018-02-22T20:48:32Z [INFO] Task engine [arn:aws:ecs:us-east-1:110372068158:task/290e0ecf-87ec-4305-8b6a-4ba02484b451]: stopping container [datadog-agent]
2018-02-22T20:48:32Z [INFO] Task engine [arn:aws:ecs:us-east-1:110372068158:task/290e0ecf-87ec-4305-8b6a-4ba02484b451]: stopping container [ronaldo]
2018-02-22T20:48:32Z [INFO] Task [arn:aws:ecs:us-east-1:110372068158:task/290e0ecf-87ec-4305-8b6a-4ba02484b451]: recording execution stopped time. Essential container [ronaldo] stopped at: 2018-02-22 20:48:32.964483693 +0000 UTC m=+508.284369640
2018-02-22T20:48:32Z [INFO] Managed task [arn:aws:ecs:us-east-1:110372068158:task/290e0ecf-87ec-4305-8b6a-4ba02484b451]: sending container change event [ronaldo]: arn:aws:ecs:us-east-1:110372068158:task/290e0ecf-87ec-4305-8b6a-4ba02484b451 ronaldo -> STOPPED, Exit 143, , Ports [{8007 32768 0.0.0.0 0}], Known Sent: RUNNING
2018-02-22T20:48:32Z [INFO] Managed task [arn:aws:ecs:us-east-1:110372068158:task/290e0ecf-87ec-4305-8b6a-4ba02484b451]: sent container change event [ronaldo]: arn:aws:ecs:us-east-1:110372068158:task/290e0ecf-87ec-4305-8b6a-4ba02484b451 ronaldo -> STOPPED, Exit 143, , Ports [{8007 32768 0.0.0.0 0}], Known Sent: RUNNING
2018-02-22T20:48:32Z [INFO] TaskHandler: batching container event: arn:aws:ecs:us-east-1:110372068158:task/290e0ecf-87ec-4305-8b6a-4ba02484b451 ronaldo -> STOPPED, Exit 143, , Ports [{8007 32768 0.0.0.0 0}], Known Sent: RUNNING
2018-02-22T20:48:32Z [INFO] DockerGoClient: Unable to retrieve stats for container 09da36d0f636dfc6ffa5e3edf2a87a2039f4e4bde52f9cea4324b1919e0b8158: context canceled
2018-02-22T20:48:32Z [INFO] Container 09da36d0f636dfc6ffa5e3edf2a87a2039f4e4bde52f9cea4324b1919e0b8158 is terminal, stopping stats collection
2018-02-22T20:48:32Z [INFO] Managed task [arn:aws:ecs:us-east-1:110372068158:task/290e0ecf-87ec-4305-8b6a-4ba02484b451]: redundant container state change. ronaldo to STOPPED, but already STOPPED
2018-02-22T20:48:32Z [INFO] Managed task [arn:aws:ecs:us-east-1:110372068158:task/290e0ecf-87ec-4305-8b6a-4ba02484b451]: redundant container state change. ronaldo to STOPPED, but already STOPPED
2018-02-22T20:48:33Z [INFO] Loading configuration
2018-02-22T20:48:33Z [INFO] Loading state! module="statemanager"
2018-02-22T20:48:33Z [INFO] Event stream ContainerChange start listening...
2018-02-22T20:48:33Z [INFO] Restored cluster 'ronaldo-docker-staging-ECSCluster-AZZXSF9FRNWJ'
2018-02-22T20:48:33Z [INFO] Restored from checkpoint file. I am running as 'arn:aws:ecs:us-east-1:110372068158:container-instance/99024b43-6851-4f3e-8867-cdae0bf0c8ec' in cluster 'ronaldo-docker-staging-ECSCluster-AZZXSF9FRNWJ'
2018-02-22T20:48:34Z [INFO] Registered container instance with cluster!
2018-02-22T20:48:34Z [INFO] Updating container reference ronaldo in Image State - sha256:064867ec3eefbe4c257130f742bfefcb9af1f5b7e3243c0a92ed5d5f72b8adcf
2018-02-22T20:48:34Z [INFO] Saving state! module="statemanager"
2018-02-22T20:48:34Z [INFO] Task [arn:aws:ecs:us-east-1:110372068158:task/290e0ecf-87ec-4305-8b6a-4ba02484b451]: recording execution stopped time. Essential container [ronaldo] stopped at: 2018-02-22 20:48:34.047464366 +0000 UTC m=+0.455235052

The last group of lines, from "Loading configuration" to "Task [...]: recording execution stopped time", is repeated indefinitely from here on out.

I ran the log-collector, but Github won't let me upload a .tgz file, so let me know if there's an alternate way to send the file.

adnxn commented 6 years ago

@alexhall, you're right. I think digging into the logs will help. I'm not able to see much from the snippet here. Please send the .tgz file to adnkha at amazon dot com. Thanks.

alexhall commented 6 years ago

@adnxn I emailed you the logfiles. Unfortunately I wound up having to tear down the container instance and start a new one, so I won't be able to provide further info. But this wasn't the first time I encountered the agent error while placing tasks, so if there are further debugging steps let me know in case it happens again.

adnxn commented 6 years ago

so if there are further debugging steps let me know in case it happens again.

If you can repro this with ECS_LOGLEVEL=debug for the agent and debug level logging for docker. It would be a lot of help and give us more visibility into what is triggering the restart loop. Definitely something strange going on from looking at the info level logs. Thanks again.

anosulchik commented 6 years ago

Hello, I'd like to confirm that exactly the same situation happens on multiple container instances in our infrastructure -- ECS agent restarts in the infinite loop (see the logs below). We noticed that it happens on container instances running containers for a while and that it might be triggered by deployment into ECS service.

In order to recover capacity for ECS tasks in ECS cluster there are two options: remove /var/lib/ecs and restart ECS agent (or just wait for its next start since it restarts constantly), or replace container instance.

Unfortunately we don't have ECS_LOGLEVEL=debug for existing ECS agents and the problem cannot be easily reproduced.

We faced this issue with ECS agents 1.14.5 and 0.16.0.


2018-02-26T16:29:53Z [INFO] Updating container reference primary in Image State - sha256:198a28fb35d4fb5485a90834dd99a80e6af4e8e7fab9a49382b62abc849b0540
2018-02-26T16:29:53Z [INFO] Saving state! module="statemanager"
2018-02-26T16:29:54Z [INFO] Loading configuration
2018-02-26T16:29:54Z [INFO] Loading state! module="statemanager"
2018-02-26T16:29:54Z [INFO] Event stream ContainerChange start listening...
2018-02-26T16:29:54Z [INFO] Restored cluster 'tf-ecs-cluster-prod'
2018-02-26T16:29:54Z [INFO] Restored from checkpoint file. I am running as 'arn:aws:ecs:us-east-1:111:container-instance/74f57da3-1b8a-4c6f-8a12-517efdd83744' in cluster 'tf-ecs-cluster-prod'
2018-02-26T16:29:54Z [INFO] Registered container instance with cluster!
2018-02-26T16:29:54Z [INFO] Updating container reference primary in Image State - sha256:198a28fb35d4fb5485a90834dd99a80e6af4e8e7fab9a49382b62abc849b0540
2018-02-26T16:29:54Z [INFO] Saving state! module="statemanager"
2018-02-26T16:29:54Z [INFO] Loading configuration
2018-02-26T16:29:54Z [INFO] Loading state! module="statemanager"
2018-02-26T16:29:54Z [INFO] Event stream ContainerChange start listening...
2018-02-26T16:29:54Z [INFO] Restored cluster 'tf-ecs-cluster-prod'
2018-02-26T16:29:54Z [INFO] Restored from checkpoint file. I am running as 'arn:aws:ecs:us-east-1:111:container-instance/74f57da3-1b8a-4c6f-8a12-517efdd83744' in cluster 'tf-ecs-cluster-prod'```
richardpen commented 6 years ago

@alexhall @anosulchik Thanks for providing the logs, based on the logs, I suspect you are running into the issue #1237 where the agent could panic if the container has healthcheck configured. This issue has been fixed in ECS Agent v1.17.1, can you upgrade to the latest agent version and see if you still experience this issue?

If you are still experiencing this issue, could you share the output docker logs -f ecs-agent as the panic information won't be saved in the agent logs.

Thanks, Peng

anosulchik commented 6 years ago

@richardpen Thanks for your input. We don't have docker health checks enabled for containers running in ECS tasks. We use ECS services attached to ALB target group to validate that containers are up and running so I believe #1237 is not our case.

At this moment we don't have container instance with crashed/restarting ECS agent so we can only investigate it retrospectively. I can download all ecs agent's logs since container instance start as well as docker logs if you might need them. Just let me know.

Thank you.

richardpen commented 6 years ago

@anosulchik The log file may not contain useful information for this issue because the panic stack trace of the panic wasn't saved to the log file. If you see this issue again, I wish you could run docker logs -f ecs-agent until the agent stops, the panic information should be printed out at the end.

Thanks, Peng

anosulchik commented 6 years ago

THanks @richardpen. The problem here is that systemd script for ecs agent that comes with Amazon Linux restarts ecs container and deletes one that failed or crashed: https://github.com/segmentio/stack/blob/master/packer/ecs/root/etc/systemd/system/ecs-agent.service Since we run ecs agents in prod this setup is reasonable for us. But yeah, when the problem happens again -- before to recover container instance I'm going to collect container's logs for you. Thanks.

alexhall commented 6 years ago

@richardpen I'm not sure whether or not we had health checks enabled, but the behavior does match that described in #1237 and I can confirm that we have not seen these errors recur since upgrading to 1.17.1 a week ago.

fractos commented 6 years ago

Hi, not sure whether it helps, but we hit this issue this morning with 1.17.0 but no panic was seen in the ecs-agent docker logs. It seemed to have the "Unable to retrieve stats for container" issue for about an hour before everything went dark. SSH stopped working too. However, in /var/log/messages the following started appearing around the same time, which makes me think that it was caused not by the health check but by an ambient memory issue. Anyway, thought someone might find it useful.

Mar  8 03:35:03 ip-172-31-6-227 kernel: [768394.589284] Threadpool work: page allocation stalls for 10004ms, order:1, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
Mar  8 03:35:04 ip-172-31-6-227 kernel: [768394.597963] CPU: 2 PID: 18895 Comm: Threadpool work Tainted: G            E   4.9.75-25.55.amzn1.x86_64 #1
Mar  8 03:35:05 ip-172-31-6-227 kernel: [768394.601948] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
Mar  8 03:35:05 ip-172-31-6-227 kernel: [768394.607255]  ffffc90001ae3a00 ffffffff8130205c ffffffff817bbbd0 0000000000000001
Mar  8 03:35:06 ip-172-31-6-227 kernel: [768394.611060]  ffffc90001ae3a80 ffffffff8118585b 026040c0b7460000 ffffffff817bbbd0
Mar  8 03:35:07 ip-172-31-6-227 kernel: [768394.611060]  ffffc90001ae3a28 0000000000000010 ffffc90001ae3a90 ffffc90001ae3a40
Mar  8 03:35:08 ip-172-31-6-227 kernel: [768394.611060] Call Trace:
Mar  8 03:35:08 ip-172-31-6-227 kernel: [768394.611060]  [<ffffffff8130205c>] dump_stack+0x63/0x87
Mar  8 03:35:08 ip-172-31-6-227 kernel: [768394.611060]  [<ffffffff8118585b>] warn_alloc+0x11b/0x150
Mar  8 03:35:10 ip-172-31-6-227 kernel: [768394.611060]  [<ffffffff811855a2>] ? __alloc_pages_direct_compact+0x52/0x100
Mar  8 03:35:10 ip-172-31-6-227 kernel: [768394.611060]  [<ffffffff81186244>] __alloc_pages_slowpath+0x934/0xb80
Mar  8 03:35:11 ip-172-31-6-227 kernel: [768394.611060]  [<ffffffff81186673>] __alloc_pages_nodemask+0x1e3/0x250

The server has been replaced and is now running 1.17.1.

aaithal commented 6 years ago

Hi all, please let us know if you run into this again with 1.17.2 version of ECS agent. Closing this issue for now.

Thanks, Anirudh

vvs82 commented 5 years ago

Hi All! We have run into this issue with 1.29.0 version of ECS agent.