aws-solutions / instance-scheduler-on-aws

A cross-account and cross-region solution that allows customers to automatically start and stop EC2 and RDS Instances
https://aws.amazon.com/solutions/implementations/instance-scheduler-on-aws/
Apache License 2.0
535 stars 264 forks source link

ASGHandler - High Memory needs (Runtime.ExitError) #552

Open FugroEgger opened 1 month ago

FugroEgger commented 1 month ago

Describe the bug Stack region: eu-west-1 Stack-Name: cs-instance-scheduler TagName: scheduler_period UsingAWSOrganizations: Yes regions: ap-southeast-1,ap-southeast-2,eu-central-1,eu-west-1,eu-west-3,eu-north-1,me-south-1,me-central-1,us-east-1,us-east-2,us-west-1 InstanceScheduler Memory of 256MB needed (template setting MemorySize) +100 spoke accounts ASG scheduling works for the sandbox account using it, No other account use the tag as its a new deployment

Some accounts/regions need >700MB of memory for the ASGHandler execution or Runtime.ExitError happens. Even in an account/region which doesn't use EC2,RDS services

Also the scheduler:run event contains schedule_names which are not used in this account.

{
    "level": "INFO",
    "location": "decorate:448",
    "message": {
        "action": "scheduler:run",
        "account": "777777777777",
        "region": "us-east-1",
        "service": "asg",
        "current_dt": "2024-06-13T07:41:58.576978+00:00",
        "dispatch_time": "2024-06-13T07:41:58.576978+00:00",
        "schedule_names": [
            "debug2",
            "debug1"
        ]
    },
    "timestamp": "2024-06-13T07:42:05.081+00:00",
    "service": "asg",
    "cold_start": false,
    "function_name": "cs-instance-scheduler-ASGHandler0F6D6751-CZoivUbvaNX1",
    "function_memory_size": "1024",
    "function_arn": "arn:aws:lambda:eu-west-1:316891253549:function:cs-instance-scheduler-ASGHandler0F6D6751-CZoivUbvaNX1",
    "function_request_id": "531e9a6b-f6e1-4c02-9a27-32272ea6688b",
    "correlation_id": "b2e3e8e2-74b1-47cc-9ea0-c96f2c8ff5d2",
    "xray_trace_id": "1-6669a578-dd17be3a3a1c08c29997006a"
}

High Memory usage: REPORT RequestId: 531e9a6b-f6e1-4c02-9a27-32272ea6688b Duration: 702.22 ms Billed Duration: 703 ms Memory Size: 1024 MB Max Memory Used: 706 MB

The account 777777777777 in us-east-1 has 3 VPCs, no EC2 instances, no RDS databases, no ASG groups and uses Remote stack 3.0.0 The schedules "debug1" and "debug2" are not used in this account. These schedule names are only used in the Sandbox account eu-west-1 region and nowhere else in the AWS Organization spoke accounts. A lot of accounts run below 128MB usage

To Reproduce It's not happening in development environment but we only have 2 supported regions and 2 spoke accounts deployed there.

Expected behavior Less memory need. The default Memory usage 128MB should suffice or it needs a configuration parameter in Cloudformation template.

Please complete the following information about the solution:

Screenshots no

Additional context Information: the memory size of instance_scheduler.handler.asg.lambda_handler in fixed with 128 MB in the cloud formation stack template. For testing the value was manually changed on the already deployed Lambda function. All accounts are managed by Controltower, Control Tower setting prevent the default VPC creation, so supported scheduling regions can have 0 VPC

CrypticCabub commented 1 month ago

Thanks for providing such a detailed report! I really appreciate it.

Do you know if this error is consistent or if it comes and goes randomly? I am currently still investigating possible sources of the memory usage but the thing that stands out to me first is that "cold_start" was false so it's possible that memory usage is coming from previous executions. If this error occurs consistently for the same account/region then warm-starts likely aren't the culprit, but if it occurs randomly then they would seem like a much more likely candidate

FugroEgger commented 1 month ago

consistent, but not for all accounts some with "cold_start" false are successful DynamoDB table: we have 196 entries in remote_account_ids (all v3.0.0), and only 1 account (3 EC2+1 RDS, no ASG) in state Table

I have switch back to 128MB for the ASGHandler for cost purposes. The solution is running every 60min but the errors are constantly appearing over the whole hour. The ASGHandler seems to run more often like every 5 min (when i restarted solution scheduling after 2h, in the middle of an hour, the ASGHandler logged entries 5min later).

between 6:11 and 6:13 i have 150 streams in the ASGHandler0F6D6751 Log group (Errors+Sucessfull).

a successful run is this (also cold_start false):

{
    "level": "INFO",
    "location": "decorate:448",
    "message": {
        "action": "scheduler:run",
        "account": "333333333",
        "region": "eu-north-1",
        "service": "asg",
        "current_dt": "2024-06-14T06:33:16.524699+00:00",
        "dispatch_time": "2024-06-14T06:33:16.524699+00:00",
        "schedule_names": []
    },
    "timestamp": "2024-06-14T06:33:56.410+00:00",
    "service": "asg",
    "cold_start": false,
    "function_name": "cs-instance-scheduler-ASGHandler0F6D6751-CZoivUbvaNX1",
    "function_memory_size": "128",
    "function_arn": "arn:aws:lambda:eu-west-1:316891253549:function:cs-instance-scheduler-ASGHandler0F6D6751-CZoivUbvaNX1",
    "function_request_id": "39da27cc-737d-492e-8b7a-dfd658b79576",
    "correlation_id": "0484abd9-be2c-4a58-9be6-68882a05d374",
    "xray_trace_id": "1-666baaf6-e682e503e201fca04f038732"
}

REPORT RequestId: 39da27cc-737d-492e-8b7a-dfd658b79576 Duration: 5367.90 ms Billed Duration: 5368 ms Memory Size: 128 MB Max Memory Used: 121 MB

also call with no schedule_names fail.

{
    "level": "INFO",
    "location": "decorate:448",
    "message": {
        "action": "scheduler:run",
        "account": "55555555",
        "region": "eu-central-1",
        "service": "asg",
        "current_dt": "2024-06-14T05:52:51.828610+00:00",
        "dispatch_time": "2024-06-14T05:52:51.828610+00:00",
        "schedule_names": []
    },
    "timestamp": "2024-06-14T05:53:37.650+00:00",
    "service": "asg",
    "cold_start": false,
    "function_name": "cs-instance-scheduler-ASGHandler0F6D6751-CZoivUbvaNX1",
    "function_memory_size": "128",
    "function_arn": "arn:aws:lambda:eu-west-1:316891253549:function:cs-instance-scheduler-ASGHandler0F6D6751-CZoivUbvaNX1",
    "function_request_id": "0f6a8429-d91b-4294-97da-738aff89a17b",
    "correlation_id": "5f5b4a4e-717a-4db4-ba40-a7a8e472a4c9",
    "xray_trace_id": "1-666baaf6-e682e503e201fca04f038732"
}

REPORT RequestId: 0f6a8429-d91b-4294-97da-738aff89a17b Duration: 66356.88 ms Billed Duration: 66357 ms Memory Size: 128 MB Max Memory Used: 129

Do you need additional info ? DebugLog (Trace) is now turned on

CrypticCabub commented 1 month ago

Yes any trace information you have from failed runs would be very much appreciated, and if you have a failure that includes "cold_start": true that would also be great to confirm.

If you would rather not post full traces to this thread you can send them to my work email: calepea@amazon.com

For reference, there are 2 things that should trigger the ASG handler to take action:

If the errors are consistently occuring in specific accounts, then I wonder if we are receiving far more data than expected from one of the describe calls. I'll take another look at the control flow to try to confirm how long the results of those calls stay in memory.

FugroEgger commented 1 month ago

Where can i find the trace log entries? In the ASGHandler0F6D6751 logstream i can only see messages containing "level":"INFO" should there be a "level":"DEBUG"? (Cloudformation Trace:Yes)

Can you also check that an deactivated region in an account is not a problem. Some of the errors (cold_start:false) are in an account in me-south-1 region which is not activated in that account. Middle East regions are not enabled by default in the spoke accounts. Only some spoke accounts use it. Its enabled in the hub account and configured in the solution region settings.

CrypticCabub commented 1 month ago

on the AWS Lambda console, go to your ASG Handler function and click Monitor -> View X-Ray traces

The expected behavior for a deactivated region in certain accounts would be an STS assume role error when the lambda first initializes. This is technically an error and will be reported as such, but the error will not cascade into other scheduling targets. We do not currently have a mechanism to suppress these errors in the logs or dashboard metrics, however.

CrypticCabub commented 1 month ago

We could also setup a time to get onto a call with a screenshare to help troubleshoot exactly what's going on. I am on EDT time (UTC-4) and can be fairly flexible.

CrypticCabub commented 3 weeks ago

Lambda memory allocation has been made configurable in v3.0.1 along with some minor updates to the control flow for the ASG scheduling function, but I have been unable to reproduce any high memory usage scenarios during load testing.

If anybody continues to see high memory usage from the ASG handler, please let us know and kindly provide us with as much information as you have available to assist in identifying any underlying root causes.