aws / amazon-ecs-agent

Amazon Elastic Container Service Agent
http://aws.amazon.com/ecs/
Apache License 2.0
2.08k stars 612 forks source link

The closest matching container-instance xxxxxxxxx doesn't have the agent connected #1874

Closed philippefuentes closed 5 years ago

philippefuentes commented 5 years ago

Summary

Updating our task to a new image of our app failed, putting our service off line

Description

For the first time since starting using ECS more than 1 year ago, part of our clients could not use our app anymore since updating our task resulted in 0 task being deployed. The event log of the related service showed:

service vma-cluster-webapp-prod-service was unable to place a task because no container instance met all of its requirements. The closest matching container-instance 7c0066ce-597d-4a23-b36b-1bcea7b8ec46 doesn't have the agent connected. For more information, see the Troubleshooting section.

Expected Behavior

The new task should have had replaced the old one normally as always, referencing the new image corresponding to the new version of our app. instead of this, we ended up with 0 task for our service, putting our production app off line.

Observed Behavior

Number of task deployed: 0

Environment Details

Agent version: 1.25.2
Docker version: 18.06.1-ce

Supporting Log Snippets

Observing the ecs-agent log at the time of the deployment showed a bunch of entries linked to docker daemon

...
2019-02-21T11:47:21Z [WARN] DockerGoClient: Unable to retrieve stats for container bb48f32e6ef0ef10baf4e5dc8aae08039e27b111473b9e3c968d79832cad9884: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2019-02-21T11:47:21Z [WARN] DockerGoClient: Unable to retrieve stats for container bb48f32e6ef0ef10baf4e5dc8aae08039e27b111473b9e3c968d79832cad9884: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2019-02-21T11:47:21Z [WARN] DockerGoClient: Unable to retrieve stats for container bb48f32e6ef0ef10baf4e5dc8aae08039e27b111473b9e3c968d79832cad9884: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2019-02-21T11:47:21Z [WARN] DockerGoClient: Unable to retrieve stats for container bb48f32e6ef0ef10baf4e5dc8aae08039e27b111473b9e3c968d79832cad9884: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2019-02-21T11:47:21Z [WARN] DockerGoClient: Unable to retrieve stats for container bb48f32e6ef0ef10baf4e5dc8aae08039e27b111473b9e3c968d79832cad9884: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2019-02-21T11:47:21Z [WARN] DockerGoClient: Unable to retrieve stats for container bb48f32e6ef0ef10baf4e5dc8aae08039e27b111473b9e3c968d79832cad9884: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2019-02-21T11:47:21Z [WARN] DockerGoClient: Unable to retrieve stats for container bb48f32e6ef0ef10baf4e5dc8aae08039e27b111473b9e3c968d79832cad9884: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2019-02-21T11:47:21Z [WARN] DockerGoClient: Unable to retrieve stats for container bb48f32e6ef0ef10baf4e5dc8aae08039e27b111473b9e3c968d79832cad9884: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2019-02-21T11:47:21Z [WARN] DockerGoClient: Unable to retrieve stats for container bb48f32e6ef0ef10baf4e5dc8aae08039e27b111473b9e3c968d79832cad9884: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2019-02-21T11:47:21Z [WARN] DockerGoClient: Unable to retrieve stats for container bb48f32e6ef0ef10baf4e5dc8aae08039e27b111473b9e3c968d79832cad9884: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2019-02-21T11:47:21Z [WARN] DockerGoClient: Unable to retrieve stats for container bb48f32e6ef0ef10baf4e5dc8aae08039e27b111473b9e3c968d79832cad9884: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2019-02-21T11:47:21Z [WARN] DockerGoClient: Unable to retrieve stats for container bb48f32e6ef0ef10baf4e5dc8aae08039e27b111473b9e3c968d79832cad9884: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2019-02-21T11:47:21Z [WARN] DockerGoClient: Unable to retrieve stats for container bb48f32e6ef0ef10baf4e5dc8aae08039e27b111473b9e3c968d79832cad9884: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2019-02-21T11:47:21Z [WARN] DockerGoClient: Unable to retrieve stats for container bb48f32e6ef0ef10baf4e5dc8aae08039e27b111473b9e3c968d79832cad9884: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

...

We managed to fix the problem by forcing an additional instance through auto scaling desired capacity, to force starting the new tasks on a brand new machine (passing from 1 to 2)

Does anyone know what happened and how we can prevent this from happening again ? we're counting on container orchestration to auto heal from this kind of problems ...

Thank you in advance

petderek commented 5 years ago

Hey,

Sorry that you are having trouble with your service. We will need a little more info to debug this problem. Do you still have the agent and docker logs from this incident? You can either use our logs collector or manaully collect them and email them to me (petderek at amazon dot com).

philippefuentes commented 5 years ago

Hi, From what I understand the log collector is used to collect the log from an instance. Unfortunately, the instance the problem occured on is long gone, it was replaced by a new one to be able to rescheduled the tasks, as stated in the issue. I only have access to the logs through cloudwatch, I saw that a log group with a specific time range can be exported to S3, should do that ?

petderek commented 5 years ago

That would also work! I'm guessing docker became unresponsive at some point -- but will need to analyze the logs to validate that.

philippefuentes commented 5 years ago

Could do the extract, sent you an email with this object: "Logs for github.com/aws/amazon-ecs-agent/issues/1874" Thank you very much.

petderek commented 5 years ago

Thanks -- We will take a look and get back to you

philippefuentes commented 5 years ago

In cas it can be of any help, I also checked ecs-init.log, can see 2 ERROR around the time of our "crash" (11:47 , the 21/02):

2019-02-20T15:07:46.000Z 2019-02-20T15:07:46Z [INFO] pre-start
2019-02-20T15:07:48.000Z 2019-02-20T15:07:48Z [INFO] start
2019-02-20T15:07:48.000Z 2019-02-20T15:07:48Z [INFO] No existing agent container to remove.
2019-02-20T15:07:48.000Z 2019-02-20T15:07:48Z [INFO] Starting Amazon Elastic Container Service Agent
2019-02-21T11:47:21.000Z 2019-02-21T11:47:21Z [ERROR] could not start Agent: Post http://unix.sock/v1.25/containers/56818a252c5dff672b5ecae952bcbcb5504319a60b9b97ff6ea7a7e5ba0c1ccf/wait: EOF
2019-02-21T11:47:21.000Z 2019-02-21T11:47:21Z [ERROR] cannot connect to Docker endpoint

image

ERROR entries in ecs-init.log seem to be quite rare, they are the only result over a search of 4 month for example:

image

To add more info about the context, we have instance auto scaling on and we run between 7 and 15 containers on one instance most of the time during the day. But when we deploy a new version (the deployment strategy is to bring the new version up before removing old tasks, so we need twice as much containers during deployment), or the traffic is more important (we have auto scaling at task level), another instance is started when we reach 16 containers. So in practice we scale up to 2 instances a couple of times per day, for a relatively short period of time (end of deployment / end of traffic peak), then automatically scale down to 1 instance (we always keep the newest instance, after draining the old one from all tasks to avoid service interruption).

petderek commented 5 years ago

I took a look through the agent logs as well. It looks like the docker daemon stopped abruptly. Many in process requests started returning EOF and it stopped responding to requests altogether afterwards. Its not clear what would have caused it to go down based on the logs I've looked at.

This happened when you only had one instance active, so the service didn't have any additional options to place the task elsewhere. Since the docker stopped during task launch, the service wasn't able to know about the problem in advance, either.

Does anyone know what happened and how we can prevent this from happening again ? we're counting on container orchestration to auto heal from this kind of problems ...

The problem here is that something at the infrastructure layer (docker) quit working, which we don't have a good self-healing story for now. We can open this as a feature request (eg, 'better recovery from docker crashes').

That said, there are some other solutions you can explore for now:

The best way to have a better availability guarantee is to always use multiple instances (preferably accross availability zones) to isolate yourself from one-off problems stemming from individual hosts.

You could look into healthchecks for instances in your autoscaling group, and have it add additional hosts if docker is down.

Another thing you could consider is Fargate, which abstracts the infrastructure away entirely.

philippefuentes commented 5 years ago

Thank you for your feedback, From what you're saying, the less time consuming solution for now would be to update our launch configuration to use at least 2 "less powerful" instances instead of only one.

Using Fargate would be the next step I think, as we can't really afford to spend additional time on improving our auto scaling processes (we already spent much time setting up proper instance autoscaling using lambda functions).

Is migrating from ECS to Fargate an easy process ?

petderek commented 5 years ago

Is migrating from ECS to Fargate an easy process ?

Depends on what you are doing. Usually, its pretty straightforward. However, you might be relying on features that aren't supported on Fargate (such as privileged containers).

We have some guides on the subject: https://containersonaws.com/introduction/ec2-or-aws-fargate/ https://aws.amazon.com/blogs/compute/migrating-your-amazon-ecs-containers-to-aws-fargate/

In addition to our docs: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ECS_GetStarted.html

philippefuentes commented 5 years ago

Thanks a lot for the links, will take a look.

shubham2892 commented 5 years ago

Closing the issue as no further action required.