alexcasalboni / aws-lambda-power-tuning

AWS Lambda Power Tuning is an open-source tool that can help you visualize and fine-tune the memory/power configuration of Lambda functions. It runs in your own AWS account - powered by AWS Step Functions - and it supports three optimization strategies: cost, speed, and balanced.

Apache License 2.0

5.5k stars 379 forks source link

Power-tuning involving only cold starts #176

Closed alexcasalboni closed 2 months ago

alexcasalboni commented 2 years ago

The tool could provide an option to power-tune a given function considering only cold start invocations.

The current logic is based on aliases in order to maximize parallelism and optimize overall speed of the power-tuning process. Unfortunately, it also makes it hard to achieve cold-start-only tuning.

We could implement an alternative logic such as:

When parameter forceColdStarts (or onlyColdStarts) is provided: [Initializer] Do nothing (no alias or version needs to be created) [Executor] Invoke $LATEST sequentially after forcing a cold start (by updating power config and an env variable)

This should work fine for all values of num and any power value.

The only drawback is that nothing can be parallelized, which isn't a big issue as long as each invocation is short enough. For example, if the average cold invocation takes 5 seconds, with num=20 and 5 power values, the overall power-tuning process will take about 8 minutes. It's very easy to reach 40+ minutes with 10s invocations and num>50.

alexcasalboni commented 2 years ago

This is somehow related to the (closed) issue #123.

@Parro what do you think of this approach?

Twitter thread with Paul Johnston for reference: https://twitter.com/alex_casalboni/status/1556585120332759040

Parro commented 2 years ago

I was thinking of a different approach: we could create a set of different versions of the lambda with the same code in the Initializer step, this way every invocation of a lambda version should spawn a new process with its cold start. This way we could still use parallelization, and even add a new step in the state machine so that we can have the statistics of Duration andInit Duration together in the report.

What do you think about it?

alexcasalboni commented 2 years ago

@Parro yes, that's what Paul proposed too.

Let's double-check the Lambda quotas :)

Is there any limit regarding the # of aliases per function? Or any API rate-limiting when creating new versions and aliases? I never encountered any limitations since we only create one version/alias per power value.

Let's assume there are no such limitations.

With x power values and num invocations, we'll need to create x * num versions and aliases, so that we can invoke them in parallel. I often run the tool with 5 power values and num=50 (or even 100), so that means 250+ versions and aliases created during initialization.

I'd agree this mechanism is better for the overall execution time, even if the initialization phase will take much longer. As far as I can remember, the creation of new versions/aliases cannot be parallelized. Initializing 4-5 versions currently takes 7-8 seconds. With num=50 it will take more than 6 minutes.

Parro commented 2 years ago

Is there any limit regarding the # of aliases per function?

The only limit I am aware of is the Code storage of 75 GB. In an account with few lambdas it should not be a problem, in an account with dozens of them we could hit the limit... and of course it depends from the size of the lambda under test. We could state clearly in the documentation that it will be used lambdaSize * powerValues * num storage to make the test.

As far as I can remember, the creation of new versions/aliases cannot be parallelized.

It's a lambda limitation? Even if we use a map step in the state machine it would fail?

Anyway, even if the initializing time is long, but we could also state this in the docs to warn the user.

alexcasalboni commented 2 years ago

It's a lambda limitation? Even if we use a map step in the state machine it would fail?

Yes, because you're always working on $LATEST when creating new versions and aliases.

I've just implemented a first iteration of this (both initializer and cleanr logic). I'm going to run some tests and share the WIP code in a new PR later today.

alexcasalboni commented 2 years ago

@Parro it works :) Check out the PR #177

ryancormack commented 10 months ago

Hey @alexcasalboni @Parro I was wondering if there's any movement on this? I noticed a few open PRs that seem to be working, but not much recent activity. Is there anything that I could help with if there are some rough edges that need a hand?

If there's 1 PR that might be the direction this moves in (if indeed it is planned to move forward with this feature), then I could just clone that version and deploy that short term.

Thanks

alexcasalboni commented 10 months ago

@ryancormack thank for checking :) yes, we're definitely moving forward to find the ideal solution for this!

Currently, there are two open PRs using different approaches:

https://github.com/alexcasalboni/aws-lambda-power-tuning/pull/177 (a bit old at this point) is creating num new versions/aliases for each memory configuration, which does work but it's a bit extreme in the amount of overhead it creates and # of API calls - you easily end up creating and destroying hundreds of versions/aliases, and it also creates an upper-bound of approaximately 500 versions/aliases we can create in 15min (reducing the number of configurations*invocations you can test)
https://github.com/alexcasalboni/aws-lambda-power-tuning/pull/206 is moving the version/alias creation into a state machine loop, therefore removing the above constraint, at the expense of making the state machine more complex and expensive to run (for all use cases, not only cold starts)

That said, I'm quite sure the second PR is closer to the direction we'll choose and I'd recommend you clone that version for the time being. I'm currently working with a few colleagues at AWS to speed up the maintenance of this tool, so I'd expect we'll settle on a final solution in the next 30-60 days 🚀

ryancormack commented 10 months ago

Thanks Alex, it's working mostly well. I've created that issue above - I know it was slightly mentioned way up this issue, but I don't know if it's more nuanced and neither PR currently accounts for it.

The tool was super helpful in actually spawning a huge number of cold starts for me, which was really helpful, and I could use Cloudwatch Log Queries to get the 'end user latency' times that I was hoping to get.

alexcasalboni commented 7 months ago

Quick update on this: we're continuing our work on #206 - it turns out that approach is also useful to solve a SnapStart-related problem.

Apologies for the delay, we should be able to finalize the current implementation in a matter of weeks.

TonySherman commented 4 months ago

Any progress on this feature? I have some use cases for profiling cold starts only.

alexcasalboni commented 2 months ago

hi all 👋 apologies for the very long wait :)

206 is finally merged 🚀

It's not on SAR yet, but you can deploy the latest via CLI.

Closing this issue, but let us know if you encounter any problems.