aws / aws-sdk-js-v3

Modularized AWS SDK for JavaScript.
Apache License 2.0
2.95k stars 553 forks source link

Coldstart Performance/Bundle Size Regression in 3.577.0 #6144

Closed perpil closed 4 days ago

perpil commented 3 weeks ago

Checkboxes for prior research

Describe the bug

I'm seeing a larger than normal regression in coldstart performance and treeshaken bundle size after 3.575. This chart shows cold start latency and esbuild built bundle size by sdk. It's a small sample size (3 runs), but I'm seeing +24 ms in coldstart and +146K in bundle size.


aws sdk node version avg(@initDuration) bundleSize # runs
3.575.0 v20.13.1 246.79 316588 3
3.577.0 v20.13.1 270.0433 463023 3

SDK version number

@aws-sdk/client-dynamodb@3.577.0

Which JavaScript Runtime is this issue in?

Node.js

Details of the browser/Node.js/ReactNative version

v20.13.1

Reproduction Steps

Read the README in this Repro

Observed Behavior

Increased coldstart times and bundle size.

Expected Behavior

Minimal impact to coldstart and bundle size on tree-shaken code.

Possible Solution

Looking at the bundle analyzer, the biggest size impact are from the @smithy/* packages, but it also seems like fast-xml-parser appeared too, I haven't dug much beyond that.

3 575-repro 3 577-repro

Additional Information/Context

In my app that I upgraded, I saw higher impacts to coldstart performance (>90ms) between 3.563 and 3.577. I was using more clients however: client-sts, client-dynamodb, client-iam, and lib-dynamodb. To keep things simple, I did the repro with just client-dynamodb.

mymattcarroll commented 3 weeks ago

The size difference in the following packages has increased our lambda deployment package size to over the hard limit of 262144000 bytes:

perpil commented 3 weeks ago

@mymattcarroll Although there was a bump in size, I was seeing it be on the order of kilobytes not megabytes. I limited my repro to one package though. Were you close to the 250 MB limit prior? Do you have numbers of your size between 3.575 and 3.577 or later? If you can drop some numbers in here, that may help determine whether it is the same cause or a different one.

kuhe commented 3 weeks ago

You should mark fast-xml-parser as external if you don't need an XML-based client.

Another 10kb, which we the SDK team should fix, is the inclusion of tslib in 3.576.0, which can be removed.

After these fixes the difference should only be 134.6kb -> 140.3kb

test code:

npm init -y
npm i @aws-sdk/client-dynamodb@3.577.0 esbuild --save-exact
echo "import { DynamoDBClient, ListTablesCommand } from \"@aws-sdk/client-dynamodb\";\
      \
      export { DynamoDBClient, ListTablesCommand };" > index.ts
npx esbuild --bundle index.ts --outfile=dist/bundle.js --minify --analyze --format=esm --main-fields=module,main --platform=node \
    --external:@aws-sdk/client-sts \
    --external:@aws-sdk/client-sso* \
    --external:@aws-sdk/credential-prov* \
    --external:@aws-sdk/token-providers \
    --external:fast-xml-parser
npm i @aws-sdk/client-dynamodb@3.575.0 --save-exact
npx esbuild --bundle index.ts --outfile=dist/bundle.js --minify --analyze --format=esm --main-fields=module,main --platform=node  \
    --external:@aws-sdk/client-sts \
    --external:@aws-sdk/client-sso* \
    --external:@aws-sdk/credential-prov* \
    --external:@aws-sdk/token-providers \
        --external:fast-xml-parser
kuhe commented 3 weeks ago

When we publish the fix in #6149 you shouldn't have to do anything different anymore. Taking the dist-es (ESM, tree-shakeable) code for bundling should avoid importing both tslib and fast-xml-parser.

mymattcarroll commented 3 weeks ago

Turns out our problem was package-lock.json containing duplicate versions of the AWS SDK packages. Regenerating our package-lock.json fixed our issue.

Credit to @divporter for teaching me about the npm dedupe command.

perpil commented 3 weeks ago

I just tried 3.588. I'm seeing better coldstart times, but still 50K in bloat. One thing I noticed was smithy/core was bundled as cjs instead of esm. Here's the breakdown of where the size increases are coming from if it helps:

Delta 3.575 -> 3.588 @smithy +56.1

kuhe commented 3 weeks ago

@siddsriv is making the same change in @smithy/core https://github.com/smithy-lang/smithy-typescript/pull/1297

perpil commented 2 weeks ago

Numbers for 3.590 are much more inline with 3.575. I think we can close now. Thanks for all your help here!


aws sdk node version avg(@initDuration) bundleSize # runs
3.575.0 v20.13.1 246.79 316588 3
3.577.0 v20.13.1 270.0433 463023 3
3.590.0 v20.13.1 241.88 318138 1

perpil commented 2 weeks ago

Actually, hold on closing here. Although I saw better performance and size on the repro, I still saw poor performance when I updated my prod code to 3.590.0. I had to manually set @smithy/core to 2.2.0 to get proper bundle size and still wasn't seeing the same performance. Let me dig in a bit.

perpil commented 2 weeks ago

3.592 updates the @smithy/core issue above and fixes the bundle size issue I was having in my prod code (not the repro).

Here's some data from my production api. The best performance was from 3.568 for me, I'm seeing about a 50 ms coldstart delta still and I'm still digging into it.


@awsSdk node runtime avg(initDuration) avg(secrets loading) max(bundleSize)
3.590.0 v20.13.1 2024-05-13T12:21:54.000Z 294.255 136.25 1122060
3.592.0 v20.13.1 2024-05-13T12:21:54.000Z 299.41 134.875 840024
3.568.0 v20.13.1 2024-05-13T12:21:54.000Z 243.1 136 837877

perpil commented 2 weeks ago

I've updated the repro to include more clients client-sts, client-dynamodb,lib-dynamodb and client-iam. The performance difference between 3.575 and 3.592 is negligible, but you can see the 50 ms difference between 3.568 and 3.592 clearly here.

aws sdk node version avg(@initDuration) bundleSize # runs
3.568.0 v20.13.1 213.5387 496644 8
3.575.0 v20.13.1 269.3038 504269 8
3.592.0 v20.13.1 265.4675 504954 8
dreamorosi commented 2 weeks ago

50ms on these number is 25% slower, not sure I agree with the negligible part.

Would be great to understand what was the trade off for adding this and if there's anything at all that we can do to reverse it besides pinning to 3.575.

perpil commented 2 weeks ago

Hey @dreamorosi, the negligible is between 3.575 and 3.592. However, I then compared 3.568 to 3.592 and saw the 50ms difference when I replicated what I saw in my production code. I agree that 50ms is a big regression and you should be able to replicate my results with the code I've currently checked into the repo. I looked at a few things to see if I could identify the source and I found some tiny things (shared-ini-loader and one other package I'm blanking on) but nothing that explains the 50 ms. The source of the latency no longer seems to be related to code size because I tried minification and didn't see much difference.

perpil commented 2 weeks ago

Alright, here's some more benchmarks with my repro. Looks like the performance bombed in 3.575. I'd recommend taking a close look at https://github.com/aws/aws-sdk-js-v3/pull/6088

aws sdk node version avg(@initDuration) bundleSize # runs
3.568.0 v20.13.1 206.165 211662 2
3.569.0 v20.13.1 211.84 211837 2
3.572.0 v20.13.1 214.2233 211837 3
3.574.0 v20.13.1 206.9233 211837 3
3.575.0 v20.13.1 269.46 214626 2
3.592.0 v20.13.1 270.725 214940 2
kuhe commented 2 weeks ago

Could you create a repro outside of Lambda that isolates the issue?

I have run the CDK example and the initDuration for v3.592.0 is in the 210-220 range. If there is an issue with the SDK it should be demonstrable in Node.js outside of Lambda.

Add a performance.now() mark to the top of the handler file and then to the top of the handler function. This should correspond to the cold start initialization duration. I did this, and no difference in init duration was shown. I also looked at the diff between the two bundled files and there wasn't anything that would take 50ms.

perpil commented 2 weeks ago

I'll see what I can do. Just to confirm, you are comparing 3.574 to 3.592 and initializing multiple clients (like the repro) and you don't see much difference?

kuhe commented 2 weeks ago

Yes, I used the provided repository and ran both those versions many times using the provided execution command examples in the readme.

I also tested the generated bundles themselves with additional timing reports, edited by hand, but those consisted merely of

const A = performance.now(); // top of file
...
const B = performance.now(); // top of handler function
...
console.log("init time", B - A); // before end of handler function

The contributing factors to initialization time in AWS Lambda are not entirely related to the user code, and fluctuate over time due to fleet provisioning. I don't know the details, so I'd like to remove that variance.

jaapvanblaaderen commented 2 weeks ago

Not sure if I'm running into the same issue but I saw something similar when upgrading to a newer SDK version and found this issue. The size of my Lambda function increased from 3.8mb to 6.1mb. I also use the built in CDK NodejsFunction which bundles with esbuild (CJS, not minifying the code).

When analyzing the bundle, it seems that @smithy/... packages are now part of every AWS SDK package and responsible for the majority of the size.

AWS SDK 3.556.0 (click to enlarge): 3 556 0

AWS SDK 3.592.0 (click to enlarge): 3 592 0

perpil commented 1 week ago

@jaapvanblaaderen It could potentially be related. I was seeing something similar with esm modules until they made a change to the packaging.

@kuhe WRT to the performance.now() benchmarking. I haven't been able to isolate this from lambda yet, but I do have some new performance benchmarks where I've wrapped the imports outside of the handler with performance.now(). The @initTime in the timings below shows that, but I haven't yet updated the repro code since I'm OOTO today. What is interesting is @initDuration and @initTime is consistently better for 3.574 and the percentage difference between the two SDKs is consistent for both performance.now and @initDuration. I did take a look at the diff between the actual bundled code and it wasn't much between 3.574 and 3.575. I'll see if I can isolate exactly what code is causing this performance regression.

@awsSdk @node avg(initDuration) max(@bundleSize) runs avg(@initTime) min(@initTime) max(@initTime) pct(@initTime,90)
3.574.0 v20.13.1 213.1845 211917 11 7.0253 6.646 8.3802 7.2597
3.575.0 v20.13.1 264.9182 214706 11 8.1337 7.5145 10.1601 8.8453

perpil commented 1 week ago

I've narrowed it down to the exact line that got added in 3.575.

If I patch that line from:

export const keepAliveSupport = {
  supported: Boolean(typeof Request !== "undefined" && "keepalive" in new Request("https://[::1]")),
};

To:

export const keepAliveSupport = {
  supported: true,
};

Then the 50 ms latency hit disappears in 3.575. Wild guess: maybe DNS is getting involved and there is a 50 ms timeout?

richarddd commented 1 week ago

I've narrowed it down to the exact line that got added in 3.575.

If I patch that line from:

export const keepAliveSupport = {
  supported: Boolean(typeof Request !== "undefined" && "keepalive" in new Request("https://[::1]")),
};

To:

export const keepAliveSupport = {
  supported: true,
};

Then the 50 ms latency hit disappears in 3.575. Wild guess: maybe DNS is getting involved and there is a 50 ms timeout?

Hi @perpil. How did you measure/narrow it down to that specific line? This is only relevant in Node 16 as Node 18+ has Request object at global scope so keepAlive will be true. This change was also added 9 months ago. Test here:

export const handler = async (event) => {
    const response = {
        statusCode: 200,
        body: JSON.stringify({
            "keepAlive": Boolean(typeof Request !== "undefined" && "keepalive" in new Request("https://[::1]"))
        }),
    };
    return response;
};

You also need to test initDuration + duration in order to evaluate performance. Code outside handler contributes to initDuration and inside contributes toduration

kuhe commented 1 week ago

Thanks @perpil, that's a great help finding the most likely location of the problem. I will investigate how to fix it.

perpil commented 1 week ago

@richarddd the smithy dependencies were upgraded in 3.575 so this new code got pulled in. When you bundle using the repro further up in the thread you'll see these lines in the generated index.mjs output which weren't in 3.574:

// node_modules/@smithy/fetch-http-handler/dist-es/fetch-http-handler.js
var keepAliveSupport = {
  supported: Boolean(typeof Request !== "undefined" && "keepalive" in new Request("https://[::1]"))
};

You'll also note, there are no references to that variable in the generated output, so it isn't necessary in the node environment. When I remove that code, or hardcode it to true, I no longer see the 50 ms hit to @initDuration (it never showed up in @duration because it is loaded outside of the handler). If you want to do an accurate repro with your code, you should move the code outside the handler, compare it to not using the Request at all, and look at the resulting @initDuration. As I understand, fetch-http-handler is only supposed to be used in browser environments and node-http-handler is meant to be used when running in Lambda.

@kuhe Thanks for looking into it, I think there are a couple ways of solving it so reach out if you need help. One thing you mentioned was to try and use performance.now() to wrap the code outside the handler to do the benchmark. The issue with this approach is when it is bundled, some of the code in index.mjs is actually executed before any of the imports so you miss some of the latency. In the resulting bundle you get something like this, so @initDuration is what you should be using for an accurate benchmark.

var keepAliveSupport = ...
var start = performance.now();
...other code
var end = performance.now()
kuhe commented 1 week ago

If you update your install lockfile to pick up @smithy/fetch-http-handler@3.0.3 it should no longer perform the Request init in the module scope.

The reason fetch-http-handler appears now is that recently we added preliminary support for fetch in Node.js, i.e. using FetchHttpHandler in Node.js. The stream collector mixin now brings in some cross-over functionality from the fetch-http-handler package.

richarddd commented 1 week ago

@perpil sorry for the misunderstanding I can now reproduce the issue. Great finding! The reason for this regression is that when Request class is being used, it's actually lazily loaded and pulls in a lot of dependencies (undicii etc) that adds this 40-50ms cold starts. @kuhe reproducible can be done by simply doing:

const a = Request;

export const handler = async (event) => {
    const response = {
        statusCode: 200,
        body: JSON.stringify("slow"),
    };
    return response;
};

Comment out first line and we're back to normal. Request further pulls in more lazily evaluated classes such as URL/Stream etc when instantiated.

Source: https://github.com/nodejs/node/blob/f2381bc7cb120e43bb97ffeccd19dfdc946c8b2c/lib/internal/bootstrap/web/exposed-window-or-worker.js#L86 (basically a property that does require)

perpil commented 1 week ago

@kuhe I can confirm SDK 3.598 with @smithy/fetch-http-handler 3.0.3 no longer has the coldstart issue. The Request code is treeshaken from my bundle now that it isn't in the module scope. Thanks so much! 🚀