aws / amazon-ecs-agent

Amazon Elastic Container Service Agent
http://aws.amazon.com/ecs/
Apache License 2.0
2.08k stars 612 forks source link

Problems starting tasks and updating services: Containers stall at PENDING Stage #537

Closed aaithal closed 7 years ago

aaithal commented 8 years ago

Creating on behalf of @Overbryd, from #494:

We are experiences severe problems starting tasks and updating services.

The behaviour is really flaky, and I could not find a reliable reproduction. But it definitely happens after a certain lifetime of the ec2 container instance.

After we replace the whole ec2 container cluster, everything behaves quite well. But after some time, the agent does not properly start new tasks. Tasks stall in PENDING state.

What can I do to help debug this problem?

For example Task "f997a595-ed15-4e5b-8bf5-da5adf01cb97" just waits in PENDING. When I go checkout the ec2 instance, it has not even started a docker process.

Tell me what I can do to help debug this issue. Currently it is halting all our deployments and thus endangers our live system.

aaithal commented 8 years ago

Response from #494:

@Overbryd I checked the volume statistics for the instance on which the task f997a595-ed15-4e5b-8bf5-da5adf01cb97 was scheduled to run. I can see that the IO throughput has dropped on this volume and the latencies have gone up. It seems like this gp2 volume has run out of IO credits and is unable to perform any IO. This, and your comment

But it definitely happens after a certain lifetime of the ec2 container instance.

leads me to believe that you're exhausting IO credits on the second 22 GiB volume attached to the instance. You could fix this issue by either choosing a bigger gp2 volume so that you speed up the rate of accumulating credits or moving to a provisioned iops volume. More information the same can be found here.

Overbryd commented 8 years ago

Thank you very much for having a look at the issue.

Regarding the IO credits on the gp2 volumes, I will investigate in that direction. That is indeed a very good hint, that I would not have guessed. Also I cannot find any CloudWatch metric to monitor IO credits on gp2 volumes.

How do you deal with such issues, monitor and size volumes accordingly? Is there a recommended gp2 volume configuration for hosting ecs optimized machines?

jhovell commented 8 years ago

@Overbryd you might be interested in my last comment on the linked issue. I think by using the ECS Optimized AMI that is the "recommended" configuration but guessing if you have a high I/O app (or in my case a large number of erroneous container restarts) you may exceed that budget and need to look into customizing your EBS configuration to suit your I/O needs (as well as adding the appropriate monitoring and alerting). I don't think the ECS agent or ECS in general has much of a story today around ECS host health/monitoring etc but I would reiterate to the ECS team this would be a much welcome feature / area of improvement.

Also, you can monitor EBS credit usage on the EC2 dashboard under EBS and selecting a volume and going to the Monitoring tab. You'll want to sum up "Read Throughput" and "Write Throughput" and compare that sum to the 66 IOPs (3 IOPS per GB with the 22 GB default volume size) budget to see if you have a deficit or surplus. In my case I was definitely running a heavy deficit due to constant container restarts on a misbehaving service.

screen shot 2016-10-05 at 3 55 17 pm

Overbryd commented 8 years ago

@jhovell thank you for the elaborate answer. That was really helpful. We now switched to 500 GiB gp2 volumes and thus having a much better baseline performance.

I have to say though, a solution that goes into another direction, supporting --tmpfs on ecs tasks would dramatically reduce io pressure. Some applications are very io heavy, but they don't need persistent storage. Just a scratch location to read/write stuff from. --tmpfs would therefore be the perfect match, and leave io credits to the applications that are io heavy and need persistent storage.

jhovell commented 8 years ago

@Overbryd I don't know much about tmpfs, but doesn't that imply you would need to have RAM or virtual disk available to back the capacity you need? Might that just push the problem to another place as you will either need to pay for more RAM or another EBS-backed disk to host virtual memory? I don't know what instance class you are using but most (if not all) of the current instance types have done away with ephemeral storage so your only real option is EBS for disk-backed storage & memory will cost more than disk assuming you don't just have extra RAM going unused.

Interesting solution - did you price that out vs using io1 or a different type of volume with higher IOPS vs gp2? I would have suspected a smaller appropriate size volume with higher IOPS would have been cheaper.

samuelkarp commented 7 years ago

Also I cannot find any CloudWatch metric to monitor IO credits on gp2 volumes.

There is now a new Burst Balance metric that can help provide visibility into the credit balance of gp2 volumes.

jhovell commented 7 years ago

@samuelkarp probably the wrong place to ask, but since it is ECS centric, is there any practical way to use cloudwatch alerts to report on Burst Balance? The Cloudwatch metric isn't available as aggregated by anything logical (e.g. Autoscaling group or ideally for our use case ECS cluster) so it would be necessary to create and delete a separate alert for each and every Ec2 instance/volume as it leaves/enters the fleet. Doesn't seem very practical for proactive alerting or monitoring, right?

jhovell commented 7 years ago

@samuelkarp @Overbryd I created a Cloudformation template that uses a Lambda to aggregate gp2 BurstBalance metrics across an ECS cluster as a custom metric, creates a custom Cloudwatch Alert, and marks hosts with low BurstBalance as unhealthy with Autoscaling (which will trigger termination & replacement).

It's for a single cluster but could easily be adapted to work across a group of clusters or all clusters found in an account

https://gist.github.com/jhovell/e6639a0dceecf903193d37e181124110

Overbryd commented 7 years ago

@jhovell this metric is gold for everybody running low-end burstable clusters. 🏆

samuelkarp commented 7 years ago

@jhovell @Overbryd It seems like the root cause was low IO credits on your GP2 volumes and that the BurstBalance metric is helpful for seeing the IO credit balance. Please let us know if you run into any further issues.