Closed philippefuentes closed 5 years ago
Hey,
Sorry that you are having trouble with your service. We will need a little more info to debug this problem. Do you still have the agent and docker logs from this incident? You can either use our logs collector or manaully collect them and email them to me (petderek at amazon dot com).
Hi, From what I understand the log collector is used to collect the log from an instance. Unfortunately, the instance the problem occured on is long gone, it was replaced by a new one to be able to rescheduled the tasks, as stated in the issue. I only have access to the logs through cloudwatch, I saw that a log group with a specific time range can be exported to S3, should do that ?
That would also work! I'm guessing docker became unresponsive at some point -- but will need to analyze the logs to validate that.
Could do the extract, sent you an email with this object: "Logs for github.com/aws/amazon-ecs-agent/issues/1874" Thank you very much.
Thanks -- We will take a look and get back to you
In cas it can be of any help, I also checked ecs-init.log, can see 2 ERROR around the time of our "crash" (11:47 , the 21/02):
2019-02-20T15:07:46.000Z 2019-02-20T15:07:46Z [INFO] pre-start
2019-02-20T15:07:48.000Z 2019-02-20T15:07:48Z [INFO] start
2019-02-20T15:07:48.000Z 2019-02-20T15:07:48Z [INFO] No existing agent container to remove.
2019-02-20T15:07:48.000Z 2019-02-20T15:07:48Z [INFO] Starting Amazon Elastic Container Service Agent
2019-02-21T11:47:21.000Z 2019-02-21T11:47:21Z [ERROR] could not start Agent: Post http://unix.sock/v1.25/containers/56818a252c5dff672b5ecae952bcbcb5504319a60b9b97ff6ea7a7e5ba0c1ccf/wait: EOF
2019-02-21T11:47:21.000Z 2019-02-21T11:47:21Z [ERROR] cannot connect to Docker endpoint
ERROR entries in ecs-init.log seem to be quite rare, they are the only result over a search of 4 month for example:
To add more info about the context, we have instance auto scaling on and we run between 7 and 15 containers on one instance most of the time during the day. But when we deploy a new version (the deployment strategy is to bring the new version up before removing old tasks, so we need twice as much containers during deployment), or the traffic is more important (we have auto scaling at task level), another instance is started when we reach 16 containers. So in practice we scale up to 2 instances a couple of times per day, for a relatively short period of time (end of deployment / end of traffic peak), then automatically scale down to 1 instance (we always keep the newest instance, after draining the old one from all tasks to avoid service interruption).
I took a look through the agent logs as well. It looks like the docker daemon stopped abruptly. Many in process requests started returning EOF and it stopped responding to requests altogether afterwards. Its not clear what would have caused it to go down based on the logs I've looked at.
This happened when you only had one instance active, so the service didn't have any additional options to place the task elsewhere. Since the docker stopped during task launch, the service wasn't able to know about the problem in advance, either.
Does anyone know what happened and how we can prevent this from happening again ? we're counting on container orchestration to auto heal from this kind of problems ...
The problem here is that something at the infrastructure layer (docker) quit working, which we don't have a good self-healing story for now. We can open this as a feature request (eg, 'better recovery from docker crashes').
That said, there are some other solutions you can explore for now:
The best way to have a better availability guarantee is to always use multiple instances (preferably accross availability zones) to isolate yourself from one-off problems stemming from individual hosts.
You could look into healthchecks for instances in your autoscaling group, and have it add additional hosts if docker is down.
Another thing you could consider is Fargate, which abstracts the infrastructure away entirely.
Thank you for your feedback, From what you're saying, the less time consuming solution for now would be to update our launch configuration to use at least 2 "less powerful" instances instead of only one.
Using Fargate would be the next step I think, as we can't really afford to spend additional time on improving our auto scaling processes (we already spent much time setting up proper instance autoscaling using lambda functions).
Is migrating from ECS to Fargate an easy process ?
Is migrating from ECS to Fargate an easy process ?
Depends on what you are doing. Usually, its pretty straightforward. However, you might be relying on features that aren't supported on Fargate (such as privileged containers).
We have some guides on the subject: https://containersonaws.com/introduction/ec2-or-aws-fargate/ https://aws.amazon.com/blogs/compute/migrating-your-amazon-ecs-containers-to-aws-fargate/
In addition to our docs: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ECS_GetStarted.html
Thanks a lot for the links, will take a look.
Closing the issue as no further action required.
Summary
Updating our task to a new image of our app failed, putting our service off line
Description
For the first time since starting using ECS more than 1 year ago, part of our clients could not use our app anymore since updating our task resulted in 0 task being deployed. The event log of the related service showed:
Expected Behavior
The new task should have had replaced the old one normally as always, referencing the new image corresponding to the new version of our app. instead of this, we ended up with 0 task for our service, putting our production app off line.
Observed Behavior
Number of task deployed: 0
Environment Details
Supporting Log Snippets
Observing the ecs-agent log at the time of the deployment showed a bunch of entries linked to docker daemon
We managed to fix the problem by forcing an additional instance through auto scaling desired capacity, to force starting the new tasks on a brand new machine (passing from 1 to 2)
Does anyone know what happened and how we can prevent this from happening again ? we're counting on container orchestration to auto heal from this kind of problems ...
Thank you in advance