Closed dennis-ec closed 6 years ago
@dennis-ec, thanks for the detailed report. I have a few questions to clarify my understanding.
If I try to call the container with docker stats/logs the container is not responding.
Sounds like the docker daemon on this instance is hanging.
The ec2 instance is t2.micro
It occurs if I test the servicie with multiple Request per seconds for a long time
Given the symptoms you are describing, I suspect the problem is lack of resources on your instance. Would you be able to try using a larger instance and see if the problem goes away?
Thank you very much for your reply:
Sounds like the docker daemon on this instance is hanging.
But I can collect the stats/logs from the ec2-agent just fine. Would this be possible with a hanging docker daemon?
Would you be able to try using a larger instance and see if the problem goes away? Yes I will try this and comment later. If it's a resource issue it might be something in the tensorflow session building up over time.
Would this be possible with a hanging docker daemon?
We've seen this behavior before where the stats/logs/inspect only hangs for certain containers.
Yes I will try this and comment later. If it's a resource issue it might be something in the tensorflow session building up over time.
Okay great! Please let us know if that helps.
@dennis-ec, closing this since we have not heard back. Feel free to update this if you have any new information. Thanks.
I'm having a similar issue.
@adnxn Thank you. It did take a while to test this properly. It seems at the end it was a combination of not enough resources and the Flask debug WSGt server. Flask WSGI is not made to run in a production environment. Instead I switched to waitress.
While this was a clearly a mistake no my side, it's strange that the docker daemon for this particular instance wasn't able to restart itself or at least kill itself along with the task.
Summary
I deployed a microservice via ecs. If I start the service everything is fine. After a seemingly random period the docker containers won't leave the PENDING status in the aws console. If I now log in into one of the ec2 instance I see a long running UNHEALTHY instance. If I try to call the container with docker stats/logs the container is not responding. If I try to test the service via
curl-X GET 0.0.0.0:33641/status
I get the expected answer of a fine container.Unfortunately I didn't find a way to reproduce this error besides running the ecs service with 50 requests per seconds over night.
Description
The container executes a Flask Micro App which loads a tensorflow session at start and do the prediction via REST API. The ec2 instance is t2.micro and I deploy with a one task per host policy. The IAM role is ecsInstanceRole
I use multiple container instances in ecs combined with an Application Load Balancer. Starting this setup works just fine but after some time the described behaviour kicks in. The occurence of the behaviour is very random. Here some notes:
It occurs if I test the servicie with multiple Request per seconds for a long time
It seems like the behaviour is more likely to kick in over night.
I couldn't witness a live transition from working to the PENDING state. Therefore, it's very hard to pin down, how to reproduce.
Environment Details
The logs of the ecs agent are in a loop:
Supporting Log Snippets
This is not possible because the collector freeze at the step
Trying to inspect running Docker containers and gather Amazon ECS container agent data
Similar to when I try to call docker stats/docker logs on the microservice hosting container