Open colinbjohnson opened 2 days ago
I encountered and worked around the exact same thing just a few weeks ago. Originally I implemented the solution outlined in the AWS article but I found it to cause endless amounts of what amounts to false positives due to how it is designed. The design is not checking that a container instance remains disconnected for X minutes. It is only checking that a container instance was disconnected at minute 0 and then also at minute X. Given that connectivity can fluctuate, over a large enough fleet of instances, it happens fairly frequently that a container instance is disconnected, then connects again, then X minutes later disconnects again and causes an alert.
The improved design we implemented still sends ECS Container Instance State Change events to an SQS queue, but instead of delaying the processing of the messages for X minutes and then checking the status once, we instead simply check the container instance status from the message handler, and if the instance is still disconnected we emit a metric and put the message back on the queue to be processed again a minute later. That way we get a steady stream of metrics with about a minute's granularity, allowing for monitors that alert when an instance is continuously disconnected for more than X minutes.
Despite having this workaround in place, I too think that this should be emitted as a Cloudwatch metric out of the box.
Community Note
Tell us about your request
I would like ECS Agent Connected status published to CloudWatch Metrics.
Which service(s) is this request for?
ECS with EC2.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
The typical use case would be to alert on systems where the ECS Agent on a given Container Instance has been disconnected for a period of time and to respond to this event (either through a manual or automated means). This is difficult because:
Currently:
Are you currently working around this issue?
Additional context
Attachments
I would, happily, provide a link to the lambda solution I've written to demonstrate both: