add waiter to executor to make sure Alias is active

alexcasalboni / aws-lambda-power-tuning

AWS Lambda Power Tuning is an open-source tool that can help you visualize and fine-tune the memory/power configuration of Lambda functions. It runs in your own AWS account - powered by AWS Step Functions - and it supports three optimization strategies: cost, speed, and balanced.

Apache License 2.0

5.41k stars 373 forks source link

add waiter to executor to make sure Alias is active #190

Closed atennak1 closed 1 year ago

atennak1 commented 1 year ago

I noticed when benchmarking Lambda's new SnapStart feature that aliases take awhile to become active. Without this change the executor step would fail due to something along the lines of "invalid function state pending". In my testing it takes upwards of 2 minutes for a power factor alias to become active when SnapStart is enabled. I'm guessing this is because Lambda needs to take a VM snapshot of every new execution-environment (power factor).

I tested this with unit tests and empirically against a SnapStart enabled Lambda function.

alexcasalboni commented 1 year ago

Thanks for submitting this @atennak1 🙏

I'm having a look at the code and sharing a couple of thoughts.

alexcasalboni commented 1 year ago

This looks great 🎉

Just one line to be tested: https://coveralls.io/builds/57227309/source?filename=lambda%2Futils.js#L166

I'll have another look and do some testing by Monday :)

alexcasalboni commented 1 year ago

I've run a few tests and it can easily take a couple of minutes for a new version to be ready, so I would increase the waiter total timeout. Say, from 5*24=120 seconds (2 minutes) to 10*24=240 seconds (4 minutes).

I'm checking with the Lambda team and applying this change myself asap, no action needed.

alexcasalboni commented 1 year ago

Quick update:

There are two components to the wait time: time for Lambda to create the snapshot and duration of the customer’s initialization code. The former is generally under five minutes, but occasionally can exceed that. The initialization code duration is customer-dependent and can run for up to 15 minutes.

Based on this feedback, I'm afraid the best approach would be moving the waiter to state machine, between the Initializer and the Executor (instead of within the Executor) to avoid paying for idle and potentially timing out the Executor.

But that sounds too complex - and not worth the cost of the additional state transitions for the majority of customers who're not using SnapStart. So I think we'll be fine with increasing maxAttempts to cover the maximum invocation time for the Executor function (10*90=900seconds).