Closed mattolenik closed 4 years ago
Hi @mattolenik, ECS event stream notifications are sent either when the state of instances in your cluster changes or when the state of tasks in your cluster change. If you're building a solution to automatically scale your cluster when its running low on resources, you'd have to reconstruct your cluster state using these events. A sample implementation for the same can be found here.
If you're looking to autoscale across the CPU/Memory dimensions, it's much simpler instead to depend on the CPU/Memory reservation metrics for your cluster. Here's a tutorial for the same. Please let us know if that helps with your current setup/use-case.
Thanks, Anirudh
anirudh, wed like to scale only when we receive placement pressure. So not preemptively on low resources via cpu or memory but only in reaction to the event that a task can not be placed. The reason there is that its not always clear what is and isn't low resources. If someone wants to place a heavyweight task but the cluster isn't under pressure, pre-emptively scaling won't catch that. If we instead scale when the cluster can't accept any more items, we are OK with a short delay while the cluster scales up then tasks get placed.
Our general scaling idea is to scale up by N% on pressure, and constantly scale down by 1 box every X minutes. This will give us a saw-tooth pattern of our cluster size, and auto compensate for over-provisioning. However, to do that we need the event fired somewhere we can capture it. Without that event, or continuing event, we can't realistically scale the cluster
The event clearly exists in the event logs in the amazon UI, but we can’t find this event in the event stream. That makes reacting to this event impossible.
We’re asking if the event is published anywhere or if we need to somehow poll some other API to get at that data
The event clearly exists in the event logs in the amazon UI, but we can’t find this event in the event stream.
You're referring to the event in the service event messages, correct? If yes, you're correct in pointing out that, that particular message is not published to the event stream.
I have logged this internally as a feature request.
You can use lambda and cloudwatch events (to invoke it every minute) to bump ASG group on certain event. Example code to catch message from scheduler:
def check_insufficient_resources(cluster):
client = boto3.client('ecs')
paginator = client.get_paginator('list_services')
response_iterator = paginator.paginate(
cluster=cluster,
PaginationConfig={
'MaxItems': 500,
}
)
insufficient_resources = []
for page in response_iterator:
grouped_services = [page['serviceArns'][i:i + 10] for i in range(0, len(page['serviceArns']), 10)]
for services in grouped_services:
response = client.describe_services(
cluster=cluster,
services=services
)
for service in response['services']:
events = service['events']
sorted_events = sorted(events, key=lambda k: k["createdAt"], reverse=True)
latest_message = sorted_events[0]['message']
if 'unable to place a task because no container instance met all of its requirements' in latest_message:
insufficient_resources.append(latest_message)
return insufficient_resources
s-maj, that does not work. I already tried that, and the event with the "unable to place a task" never happens. It simply isn't something that can be captured with the event stream. It's not a start or stop event of any kind.
My workaround is to have a scheduled lambda for each cluster that looks for services that have (running tasks + pending tasks < desired tasks) and bump desired capacity in that case.
But I would love for an event to fire if a task can't be placed that can then trigger auto scaling without the intermediate lambda.
+1 I really would love to have this event ("was unable to place a task because the resources could not be found") in cloudwatch events that will allow us to trigger a lambda function to scale-out our ecs cluster.
For while something like mentioned by @mattolenik seems ok.
Until this added, a workaround would be to have a lambda on a scheduled basis call the APIs to the get events and save that to cloudwatch.
i.e: for python boto3.
boto3.client('ecs')
services_list = ecs_client.describe_services(cluster=cluster_arn, services=[service_arn])
for service in services_list['services']:
print(service['events'])
Ideally, it seems like ECS should auto-scale for this case without having to manually set up triggers and lambdas etc. Blue/Green is an important use case for ECS.
As long as you have defined enough headroom between allocated capacity and maxSize, that should be right in ECS's wheelhouse.
+1 for these events to appear in the event stream
I can confirm they do appear in the DescribeServices API output (in the 'events' attribute ) so approaches like @joeykhashab or @s-maj propose, that poll the API rather than cloudwatch events will work.
Although it would be much simpler and more robust to simply be able to trigger this on a cloudwatch event
+1
The problem with writing lambdas to poll the ECS API every minute could potentially cause throttling on the ECS API itself which can then cause even more destructive side-effects on how other services integrate with ECS. Publishing event about failure to start a task, in particular these as well would be really good. We can then act upon them, get metrics and generate alerts. It would be super useful.
Would appreciate if this can be taken as a feature request.
+1 this is kind of an critical event that would be very useful to utilize.
Thanks everyone for the feedback! Please be assured that we on the ECS team are aware of this issue, and that it is under active consideration. +1's and additional details on use cases are always appreciated and will help inform our work moving forward.
We are in the exact same scenario as several others in this thread. We currently have something similar to the tutorial setup where the ECS cluster will auto-scale if the cpu or memory headroom falls below a certain percent. This doesn't work, however when the sum of the free cpu/memory across the cluster is still within the threshold, but there's still no EC2 instance with sufficient headroom to launch a new task. We could have 10 EC2 instances, all with 400 cpu free and the alarm won't trigger, but a single task that requires 500 cpu will fail and auto-scaling never happens.
Same scenario. Manual workaround is to use CW Events as described here: https://stackoverflow.com/questions/42394656/how-to-listen-for-an-insufficient-cpu-memory-event-in-an-aws-ecs-service
This can trigger Lambda, SNS or any other mechanism so you are aware there are workloads not able to start.
I find it very strange that something so obviously very useful still hasn't been made available. Using ECS internal knowledge to scale the cluster is the only way that actually makes sense.
@toredash Unfortunately that SO solution only works if there is another AWS API Call made on the service. New API Calls will get logged and will include the list of events that have happened on the service. But if you only ever do a CreateService, the service will silently fail to be placed.
+1 Very useful especially when you don't have the liberty to add as many instances you like and allows better CICD checks.
Could also be a Cloudwatch metric "NumberOfUnschedulableTasks" or something similar. Makes it easy to integrate with scaling policies.
Could also be a Cloudwatch metric "NumberOfUnschedulableTasks" or something similar. Makes it easy to integrate with scaling policies.
This. Please! I don't want to have another lambda function just to catch this. Would be much better to have it exposed in CloudWatch, and then it can be internally supported by Cloudformtion.
The best way that I found to automatically scale an ECS cluster with what AWS currently exposes publicly is to have a Lambda function publish a custom CloudWatch metric with the number of the largest containers that can currently be scheduled in the cluster. EC2 instance autoscaling is triggered off of this custom metric and always keeps enough room in the cluster to start at least one of the largest tasks in the cluster. More details, including the code are at https://techblog.realtor.com/a-better-ecs/ and https://github.com/MoveInc/ecs-cloudformation-templates/blob/master/ECS-Cluster.template.
I would love to see better integration between EC2 and ECS autoscaling so that a lot of this work won't be necessary.
Would this solution work? https://github.com/aws/containers-roadmap/issues/76#issuecomment-485038973
FYI this launched recently: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-ecs-service-events-now-available-as-cloudwatch-events/
Hi, I am trying to Start/Stop EC2 say instance Id: B from Lambda based on CPU utilization of different EC2 Instance Id: A e.g. 1-EC2 - A CPU Utilization LT 20% - Stop EC2- B 2- EC2- A CPU Utilization GR 80% - Start EC2 - A I tried CloudWatch Alarm but stop/start the same EC2 instance rather than different EC2 instance. I created Cloudwathc Rule to tridger Lambda from where I will Start/Stop EC2 instance(s) but Rule doesn't provide CPU Utilization based event triger rather on Scheduled Date/Time. Do you have any experience or know any material/link where the similar work has been done. I will really appreciate your help and prompt response. Thanks
I'm trying to get some event for when an ECS task fails to be placed, specifically when a task cannot be placed due to insufficient resources. I want this event here to trigger a Lambda, which I can use to respond with scale-out actions.
I have tried listening to the ECS Event Stream, but no event at all was triggered for task placement failure, the Lambda trigger never occurred. I also didn't see anything in CloudWatch logs for ECS at all.
Is there any way to receive notification of this event? We are able to alert on it in DataDog, how do they get it? Do we need to resort to polling?