aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 320 forks source link

[ECS] How to programmatically get event "unable to place a task because the resources could not be found" #121

Closed mattolenik closed 4 years ago

mattolenik commented 6 years ago

I'm trying to get some event for when an ECS task fails to be placed, specifically when a task cannot be placed due to insufficient resources. I want this event here to trigger a Lambda, which I can use to respond with scale-out actions.

I have tried listening to the ECS Event Stream, but no event at all was triggered for task placement failure, the Lambda trigger never occurred. I also didn't see anything in CloudWatch logs for ECS at all.

Is there any way to receive notification of this event? We are able to alert on it in DataDog, how do they get it? Do we need to resort to polling?

aaithal commented 6 years ago

Hi @mattolenik, ECS event stream notifications are sent either when the state of instances in your cluster changes or when the state of tasks in your cluster change. If you're building a solution to automatically scale your cluster when its running low on resources, you'd have to reconstruct your cluster state using these events. A sample implementation for the same can be found here.

If you're looking to autoscale across the CPU/Memory dimensions, it's much simpler instead to depend on the CPU/Memory reservation metrics for your cluster. Here's a tutorial for the same. Please let us know if that helps with your current setup/use-case.

Thanks, Anirudh

devshorts commented 6 years ago

anirudh, wed like to scale only when we receive placement pressure. So not preemptively on low resources via cpu or memory but only in reaction to the event that a task can not be placed. The reason there is that its not always clear what is and isn't low resources. If someone wants to place a heavyweight task but the cluster isn't under pressure, pre-emptively scaling won't catch that. If we instead scale when the cluster can't accept any more items, we are OK with a short delay while the cluster scales up then tasks get placed.

Our general scaling idea is to scale up by N% on pressure, and constantly scale down by 1 box every X minutes. This will give us a saw-tooth pattern of our cluster size, and auto compensate for over-provisioning. However, to do that we need the event fired somewhere we can capture it. Without that event, or continuing event, we can't realistically scale the cluster

The event clearly exists in the event logs in the amazon UI, but we can’t find this event in the event stream. That makes reacting to this event impossible.

We’re asking if the event is published anywhere or if we need to somehow poll some other API to get at that data

aaithal commented 6 years ago

The event clearly exists in the event logs in the amazon UI, but we can’t find this event in the event stream.

You're referring to the event in the service event messages, correct? If yes, you're correct in pointing out that, that particular message is not published to the event stream.

I have logged this internally as a feature request.

s-maj commented 6 years ago

You can use lambda and cloudwatch events (to invoke it every minute) to bump ASG group on certain event. Example code to catch message from scheduler:

def check_insufficient_resources(cluster):
    client = boto3.client('ecs')

    paginator = client.get_paginator('list_services')
    response_iterator = paginator.paginate(
        cluster=cluster,
        PaginationConfig={
            'MaxItems': 500,
        }
    )

    insufficient_resources = []
    for page in response_iterator:
        grouped_services = [page['serviceArns'][i:i + 10] for i in range(0, len(page['serviceArns']), 10)]
        for services in grouped_services:
            response = client.describe_services(
                cluster=cluster,
                services=services
            )
            for service in response['services']:
                events = service['events']
                sorted_events = sorted(events, key=lambda k: k["createdAt"], reverse=True)
                latest_message = sorted_events[0]['message']

                if 'unable to place a task because no container instance met all of its requirements' in latest_message:
                    insufficient_resources.append(latest_message)

    return insufficient_resources
mattolenik commented 6 years ago

s-maj, that does not work. I already tried that, and the event with the "unable to place a task" never happens. It simply isn't something that can be captured with the event stream. It's not a start or stop event of any kind.

willthames commented 6 years ago

My workaround is to have a scheduled lambda for each cluster that looks for services that have (running tasks + pending tasks < desired tasks) and bump desired capacity in that case.

But I would love for an event to fire if a task can't be placed that can then trigger auto scaling without the intermediate lambda.

nullck commented 6 years ago

+1 I really would love to have this event ("was unable to place a task because the resources could not be found") in cloudwatch events that will allow us to trigger a lambda function to scale-out our ecs cluster.

For while something like mentioned by @mattolenik seems ok.

joeykhashab commented 6 years ago

Until this added, a workaround would be to have a lambda on a scheduled basis call the APIs to the get events and save that to cloudwatch.

i.e: for python boto3.

boto3.client('ecs')
services_list = ecs_client.describe_services(cluster=cluster_arn, services=[service_arn])
for service in services_list['services']:
  print(service['events'])
tkersh commented 6 years ago

Ideally, it seems like ECS should auto-scale for this case without having to manually set up triggers and lambdas etc. Blue/Green is an important use case for ECS.

As long as you have defined enough headroom between allocated capacity and maxSize, that should be right in ECS's wheelhouse.

jonathonsim commented 6 years ago

+1 for these events to appear in the event stream

I can confirm they do appear in the DescribeServices API output (in the 'events' attribute ) so approaches like @joeykhashab or @s-maj propose, that poll the API rather than cloudwatch events will work.

Although it would be much simpler and more robust to simply be able to trigger this on a cloudwatch event

dsouzajude commented 6 years ago

+1

The problem with writing lambdas to poll the ECS API every minute could potentially cause throttling on the ECS API itself which can then cause even more destructive side-effects on how other services integrate with ECS. Publishing event about failure to start a task, in particular these as well would be really good. We can then act upon them, get metrics and generate alerts. It would be super useful.

Would appreciate if this can be taken as a feature request.

hampsterx commented 6 years ago

+1 this is kind of an critical event that would be very useful to utilize.

coultn commented 5 years ago

Thanks everyone for the feedback! Please be assured that we on the ECS team are aware of this issue, and that it is under active consideration. +1's and additional details on use cases are always appreciated and will help inform our work moving forward.

efenderbosch commented 5 years ago

We are in the exact same scenario as several others in this thread. We currently have something similar to the tutorial setup where the ECS cluster will auto-scale if the cpu or memory headroom falls below a certain percent. This doesn't work, however when the sum of the free cpu/memory across the cluster is still within the threshold, but there's still no EC2 instance with sufficient headroom to launch a new task. We could have 10 EC2 instances, all with 400 cpu free and the alarm won't trigger, but a single task that requires 500 cpu will fail and auto-scaling never happens.

toredash commented 5 years ago

Same scenario. Manual workaround is to use CW Events as described here: https://stackoverflow.com/questions/42394656/how-to-listen-for-an-insufficient-cpu-memory-event-in-an-aws-ecs-service

This can trigger Lambda, SNS or any other mechanism so you are aware there are workloads not able to start.

siwyd commented 5 years ago

I find it very strange that something so obviously very useful still hasn't been made available. Using ECS internal knowledge to scale the cluster is the only way that actually makes sense.

mtsr commented 5 years ago

@toredash Unfortunately that SO solution only works if there is another AWS API Call made on the service. New API Calls will get logged and will include the list of events that have happened on the service. But if you only ever do a CreateService, the service will silently fail to be placed.

vimmis commented 5 years ago

+1 Very useful especially when you don't have the liberty to add as many instances you like and allows better CICD checks.

jespersoderlund commented 5 years ago

Could also be a Cloudwatch metric "NumberOfUnschedulableTasks" or something similar. Makes it easy to integrate with scaling policies.

kalpik commented 5 years ago

Could also be a Cloudwatch metric "NumberOfUnschedulableTasks" or something similar. Makes it easy to integrate with scaling policies.

This. Please! I don't want to have another lambda function just to catch this. Would be much better to have it exposed in CloudWatch, and then it can be internally supported by Cloudformtion.

masneyb commented 5 years ago

The best way that I found to automatically scale an ECS cluster with what AWS currently exposes publicly is to have a Lambda function publish a custom CloudWatch metric with the number of the largest containers that can currently be scheduled in the cluster. EC2 instance autoscaling is triggered off of this custom metric and always keeps enough room in the cluster to start at least one of the largest tasks in the cluster. More details, including the code are at https://techblog.realtor.com/a-better-ecs/ and https://github.com/MoveInc/ecs-cloudformation-templates/blob/master/ECS-Cluster.template.

I would love to see better integration between EC2 and ECS autoscaling so that a lot of this work won't be necessary.

coultn commented 5 years ago

Would this solution work? https://github.com/aws/containers-roadmap/issues/76#issuecomment-485038973

coultn commented 4 years ago

FYI this launched recently: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-ecs-service-events-now-available-as-cloudwatch-events/

Hackathon-G7 commented 4 years ago

Hi, I am trying to Start/Stop EC2 say instance Id: B from Lambda based on CPU utilization of different EC2 Instance Id: A e.g. 1-EC2 - A CPU Utilization LT 20% - Stop EC2- B 2- EC2- A CPU Utilization GR 80% - Start EC2 - A I tried CloudWatch Alarm but stop/start the same EC2 instance rather than different EC2 instance. I created Cloudwathc Rule to tridger Lambda from where I will Start/Stop EC2 instance(s) but Rule doesn't provide CPU Utilization based event triger rather on Scheduled Date/Time. Do you have any experience or know any material/link where the similar work has been done. I will really appreciate your help and prompt response. Thanks