aws / amazon-ecs-agent

Amazon Elastic Container Service Agent
http://aws.amazon.com/ecs/
Apache License 2.0
2.08k stars 611 forks source link

Docker Container not reachable inside ec2 instance although its microservice is running #1346

Closed dennis-ec closed 6 years ago

dennis-ec commented 6 years ago

Summary

I deployed a microservice via ecs. If I start the service everything is fine. After a seemingly random period the docker containers won't leave the PENDING status in the aws console. If I now log in into one of the ec2 instance I see a long running UNHEALTHY instance. If I try to call the container with docker stats/logs the container is not responding. If I try to test the service via curl-X GET 0.0.0.0:33641/status I get the expected answer of a fine container.

Unfortunately I didn't find a way to reproduce this error besides running the ecs service with 50 requests per seconds over night.

Description

The container executes a Flask Micro App which loads a tensorflow session at start and do the prediction via REST API. The ec2 instance is t2.micro and I deploy with a one task per host policy. The IAM role is ecsInstanceRole

I use multiple container instances in ecs combined with an Application Load Balancer. Starting this setup works just fine but after some time the described behaviour kicks in. The occurence of the behaviour is very random. Here some notes:

Environment Details

[ec2-user@ip-10-0-0-5 ~]$ docker ps
CONTAINER ID        IMAGE                                                                          COMMAND             CREATED             STATUS                    PORTS                   NAMES
d19a78a2eceb        317555189169.dkr.ecr.eu-central-1.amazonaws.com/docker_test_predicter:latest   "python app.py"     46 hours ago        Up 46 hours (unhealthy)   0.0.0.0:33641->80/tcp   ecs-tfpredicter-19-sm08predicter-a2e7bbc6e4aa8d991b00
047b4b0c5e83        amazon/amazon-ecs-agent:lates

The logs of the ecs agent are in a loop:

2018-04-16T07:26:34Z [INFO] Managed task [arn:aws:ecs:eu-central-1:317555189169:task/50154201-5f96-4a10-b60b-422d625f260d]: 'DockerTimeoutError' error stopping container [sm08predicter]. Ignoring state change: Could not transition to stopped; timed out after waiting 1m0s
2018-04-16T07:26:34Z [INFO] Saving state! module="statemanager"
2018-04-16T07:26:34Z [INFO] Managed task [arn:aws:ecs:eu-central-1:317555189169:task/50154201-5f96-4a10-b60b-422d625f260d]: Cgroup resource set up for task complete
2018-04-16T07:26:34Z [INFO] Task engine [arn:aws:ecs:eu-central-1:317555189169:task/50154201-5f96-4a10-b60b-422d625f260d]: stopping container [sm08predicter]
2018-04-16T07:26:37Z [WARN] Error getting cpu stats, err: No data in the queue, container: d19a78a2eceb6a4ff7ac38107d3c6856855bffaf39e33444f3c17e10ec414da9
2018-04-16T07:26:37Z [WARN] Error publishing metrics: stats engine: no task metrics to report
2018-04-16T07:26:57Z [WARN] Error getting cpu stats, err: No data in the queue, container: d19a78a2eceb6a4ff7ac38107d3c6856855bffaf39e33444f3c17e10ec414da9
2018-04-16T07:26:57Z [WARN] Error publishing metrics: stats engine: no task metrics to report
2018-04-16T07:27:04Z [INFO] DockerGoClient: error stopping container d19a78a2eceb6a4ff7ac38107d3c6856855bffaf39e33444f3c17e10ec414da9: context deadline exceeded
2018-04-16T07:27:17Z [WARN] Error getting cpu stats, err: No data in the queue, container: d19a78a2eceb6a4ff7ac38107d3c6856855bffaf39e33444f3c17e10ec414da9
2018-04-16T07:27:17Z [WARN] Error publishing metrics: stats engine: no task metrics to report
2018-04-16T07:27:34Z [INFO] Task engine [arn:aws:ecs:eu-central-1:317555189169:task/50154201-5f96-4a10-b60b-422d625f260d]: error transitioning container [sm08predicter] to [STOPPED]: Could not transition to stopped; timed out after waiting 1m0s
2018-04-16T07:27:34Z [INFO] Managed task [arn:aws:ecs:eu-central-1:317555189169:task/50154201-5f96-4a10-b60b-422d625f260d]: 'DockerTimeoutError' error stopping container [sm08predicter]. Ignoring state change: Could not transition to stopped; timed out after waiting 1m0s
2018-04-16T07:27:34Z [INFO] Saving state! module="statemanager"
[ec2-user@ip-10-0-0-5 ~]$ docker info
Containers: 2
 Running: 2
 Paused: 0
 Stopped: 0
Images: 3
Server Version: 17.12.1-ce
Storage Driver: devicemapper
 Pool Name: docker-docker--pool
 Pool Blocksize: 524.3kB
 Base Device Size: 10.74GB
 Backing Filesystem: ext4
 Udev Sync Supported: true
 Data Space Used: 1.924GB
 Data Space Total: 23.33GB
 Data Space Available: 21.4GB
 Metadata Space Used: 507.9kB
 Metadata Space Total: 25.17MB
 Metadata Space Available: 24.66MB
 Thin Pool Minimum Free Space: 2.333GB
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9b55aab90508bd389d7654c4baf173a981477d55
runc version: 9f9c96235cc97674e935002fc3d78361b696a69e
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.85-38.58.amzn1.x86_64
Operating System: Amazon Linux AMI 2017.09
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 993.2MiB
Name: ip-10-0-0-5
ID: 7NLU:WLTL:GNKE:EINC:NKOL:RPYS:A443:JVR6:UO53:RSZR:3YL7:GUFQ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Supporting Log Snippets

Please have a look at https://github.com/awslabs/ecs-logs-collector

This is not possible because the collector freeze at the step Trying to inspect running Docker containers and gather Amazon ECS container agent data Similar to when I try to call docker stats/docker logs on the microservice hosting container

adnxn commented 6 years ago

@dennis-ec, thanks for the detailed report. I have a few questions to clarify my understanding.

If I try to call the container with docker stats/logs the container is not responding.

Sounds like the docker daemon on this instance is hanging.

The ec2 instance is t2.micro

It occurs if I test the servicie with multiple Request per seconds for a long time

Given the symptoms you are describing, I suspect the problem is lack of resources on your instance. Would you be able to try using a larger instance and see if the problem goes away?

dennis-ec commented 6 years ago

Thank you very much for your reply:

Sounds like the docker daemon on this instance is hanging.

But I can collect the stats/logs from the ec2-agent just fine. Would this be possible with a hanging docker daemon?

Would you be able to try using a larger instance and see if the problem goes away? Yes I will try this and comment later. If it's a resource issue it might be something in the tensorflow session building up over time.

adnxn commented 6 years ago

Would this be possible with a hanging docker daemon?

We've seen this behavior before where the stats/logs/inspect only hangs for certain containers.

Yes I will try this and comment later. If it's a resource issue it might be something in the tensorflow session building up over time.

Okay great! Please let us know if that helps.

adnxn commented 6 years ago

@dennis-ec, closing this since we have not heard back. Feel free to update this if you have any new information. Thanks.

astericky commented 6 years ago

I'm having a similar issue.

dennis-ec commented 6 years ago

@adnxn Thank you. It did take a while to test this properly. It seems at the end it was a combination of not enough resources and the Flask debug WSGt server. Flask WSGI is not made to run in a production environment. Instead I switched to waitress.

While this was a clearly a mistake no my side, it's strange that the docker daemon for this particular instance wasn't able to restart itself or at least kill itself along with the task.