Abysmal performance of the SDK

alfaproject commented 1 year ago

Checkboxes for prior research

[X] I've gone through Developer Guide and API reference
[X] I've checked AWS Forums and StackOverflow.
[X] I've searched for previous similar issues and didn't find any solution.

Describe the bug

After being pestered for months that we need to update to the latest SDK version, we finally did the work.

Things we noticed right away:

building our project got more than 5 times slower
linting started running out of memory, and again, more than 2 to 3 times slower
unit testing with Jest was a struggle to even run on our existing CI containers but we finally managed by doubling the memory of them so we could just get tests to pass
even with all the above we decided to merge to master to see if we could measure real world performance of the application since we bundle the SDK and supposedly all the thousands of files that TS needs to load in all the above scenarios won't be a problem anymore, it's just 1 JS file with whatever is needed bundle for node to run. Well, it didn't go well either, a lot of our lambdas started to straight timing out. We had 50ms connection timeout for DynamoDB in the old SDK, and had to increase to 1s in the new SDK and even then we get the odd timeout

tl;dr I really don't understand why you are pushing people to move from v2 to v3. It's simply not usable. We have 12 billion lambda invocations per month atm and from our initial tests, if we went to production with this new SDK, we'd have to start paying at least 4 times more. Not to mention that horrible developer experience since TS just struggles massively to do anything productive.

Sadly, I can't share any code from our company but trust me on this. We know what we are doing, it's the exact same code base, just a different SDK, we double checked every single thing that we could. I will use this report for our account manager as well.

SDK version number

@aws-sdk/client-apigatewaymanagementapi@3.377.0, @aws-sdk/client-cloudwatch@3.377.0, @aws-sdk/client-dynamodb@3.377.0, @aws-sdk/client-glue@3.377.0, @aws-sdk/client-kinesis@3.377.0, @aws-sdk/client-lambda@3.377.0, @aws-sdk/client-personalize-events@3.377.0, @aws-sdk/client-personalize-runtime@3.377.0, @aws-sdk/client-quicksight@3.377.0, @aws-sdk/client-rds-data@3.377.0, @aws-sdk/client-s3@3.377.0, @aws-sdk/client-ses@3.377.0, @aws-sdk/client-sns@3.377.0, @aws-sdk/client-sqs@3.377.0, @aws-sdk/client-sts@3.377.0, @aws-sdk/credential-providers@3.377.0, @aws-sdk/lib-dynamodb@3.377.0, @aws-sdk/lib-storage@3.377.0, @aws-sdk/s3-request-presigner@3.377.0, @aws-sdk/types@3.370.0, @aws-sdk/util-dynamodb@3.377.0, @aws-sdk/util-utf8@3.374.0,

Which JavaScript Runtime is this issue in?

Node.js

Details of the browser/Node.js/ReactNative version

v16.18.1

Reproduction Steps

Migrate any production ready software platform from v2 to v3. We use a monorepo so all of our 400 lambdas are in the same repository in different 'micro-services' (CF stacks).

Observed Behavior

Developer experience is horrible. Everything takes 5 times the time, linting, testing, building, anything that TypeScript has to import from the thousands of files in the @aws-sdk folders

We could live with the above if the production side of things after building everything into a small package for node to run had performance improvements, but not, it's just as bad. Somehow even the bundled version uses more memory, and CPU resources from lambdas which basically threw all of our fine tuning away (memory and timeout configuration for each lambda had to be increased across the board)

Expected Behavior

Same developer experience. Any medium/big application should have linting and testing experience similar to v2 experience.

More importantly, same or better runtime experience or else what's the point? Having to increase our lambdas memory and timeout configuration to be able to have the same user-facing experience as before is honestly not going to happen. We are not going to increase our lambda costs overnight just because AWS wants us to update to a framework that simply isn't ready, sorry.

Possible Solution

For developer experience you can do what most other big SDKs do and concatenate the TS files into a single index.d.ts for each package

But for runtime performance, I actually have no idea what you did that made it so bad compared with v2. You will have to investigate.

Additional Information/Context

I can share anything else that isn't private information from our company and I'm happy to have a video call with any systems architect from AWS via our account manager or otherwise.

kuhe commented 1 year ago

linting started running out of memory, and again, more than 2 to 3 times slower

Are you linting the SDK's distributed code? The amount of runtime code in the user application would not have changed, so the time your linter spends passing over your source code should not change.

lot of our lambdas started to straight timing out

The v3 SDK does not inherently increase the connection time to DynamoDB from 50ms to 1000ms, since both use the node:https module.

How many clients do you initialize per handler execution, and how many requests are made per client before it is garbage-collected? You may want to set the node:https agent max sockets value higher and/or reduce the number of client initializations. Example.

If any streams are opened by requests, ensure they are closed by reading the data or discarding it. For example, if you make a request to S3::getObject, the response body stream must be read or closed.

TS just struggles massively to do anything productive

If you are using Jest against TS source code, ensure that it is using babel's transpile without typecheck mode and not ts-jest. Try out skipLibCheck for tsconfig.

Please create a sample repository that demonstrates the issue(s). In my testing, TS compilation time, bundling size, and runtime performance are better in v3. Here is my latest test: https://github.com/kuhe/aws-sdk-js-sample.

alfaproject commented 1 year ago

I get what you are trying to say with that sample and don't get me wrong but that's far from a production-like setup unless you use lambda for fun. We have all the best practises for everything that you mentioned and use SWC for Jest because ESBuild wasn't cutting it either, so imagine.

Perhaps it's better to look at a simple example from some real code that we are allowed to share. We have a VERY THIN layer for the AWS SDK, mostly because we wanted to have a smooth transition and would be easy to revert back if needed, which it was.

Simple DynamoDB Client factory using AWS SDK v2:

import { awsRegion } from '@tma/node-env';
import DynamoDb, { DocumentClient } from 'aws-sdk/clients/dynamodb';

import { keepAliveAgent } from '../keep-alive-agent';

let client: DocumentClient | undefined;

export function dynamoDbClient() {
  if (!client) {
    const dynamodbClient = new DynamoDb({
      httpOptions: {
        agent: keepAliveAgent,
        connectTimeout: 50,
        timeout: 5000,
      },
      region: awsRegion(),
      retryDelayOptions: {
        base: 10,
      },
    });

    client = new DocumentClient({ service: dynamodbClient });
  }

  return client;
}

That's battle tested, we have billions of DynamoDB connections per month using that factory and VERY RARELY connections time out.

The exact same thing using AWS SDK v3:

import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
import { DynamoDBDocumentClient } from '@aws-sdk/lib-dynamodb';
import { NodeHttpHandler } from '@smithy/node-http-handler';

let client: DynamoDBDocumentClient | undefined;

export function dynamoDbClient() {
  if (!client) {
    const dynamodbClient = new DynamoDBClient({
      requestHandler: new NodeHttpHandler({
        connectionTimeout: 1000,
        requestTimeout: 5000,
      }),
    });

    client = DynamoDBDocumentClient.from(dynamodbClient, {
      marshallOptions: {
        convertClassInstanceToMap: true,
        removeUndefinedValues: true,
      },
    });
  }

  return client;
}

VERY RARELY a connection doesn't time out. Mind you, this is just an example but it's as good as any, maybe there's an obvious mistake but we pretty much followed the migration guide and tried to be careful.

kuhe commented 1 year ago

what version of JS SDK v2 were you using?

For your JSv3 code, have you experimented with setting the maxSockets and maxAttempts configuration like so?

import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
import { DynamoDBDocumentClient } from '@aws-sdk/lib-dynamodb';
import { NodeHttpHandler } from '@smithy/node-http-handler';
import  https from "https";

let client: DynamoDBDocumentClient | undefined;

export function dynamoDbClient() {
  if (!client) {
    const dynamodbClient = new DynamoDBClient({
      maxAttempts: 6, // default is 3
      requestHandler: new NodeHttpHandler({
        connectionTimeout: 1000,
        requestTimeout: 5000,
        httpsAgent: new https.Agent({
          maxSockets: 400 // default is 50, maximum may be 1000 based on your envrionment
        }),
      }),
    });

    client = DynamoDBDocumentClient.from(dynamodbClient, {
      marshallOptions: {
        convertClassInstanceToMap: true,
        removeUndefinedValues: true,
      },
    });
  }

  return client;
}

Adjust the numbers based on your application's usage pattern.

RanVaknin commented 1 year ago

Hi @alfaproject ,

I tested your code with 10,000 concurrent writes from lambda, and also with a fewer writes but very big and nested items and was not able to force the problem you are describing of connections timing out. (instead was hitting throughput limit exception or lambda size exception that were solved by increasing each limit)

In order to further investigate this and help you on your path of upgrading to v3, we would like to setup a call with you to discuss the problem in detail.

Setting up a meeting with the SDK team can be done by opening a support ticket through the AWS console. To speed things up you can provide a link to this issue and the Technical Account Manager should be able to quickly reach out to us in order to schedule some time.

For the meeting to be fruitful, we will need some sort of reproducible code. I know you mentioned you can't share any specifics about your actual code, but the example you provided did not reproduce the reported issue. Since the issue you were describing is not reproducible, the SDK team cannot identify the problem and take action on fixing it. Once we have that more robust repro code, we could make better use of the time in the meeting and come prepared with observations and suggestions.

Thanks, Ran~

github-actions[bot] commented 1 year ago

This issue has not received a response in 1 week. If you still think there is a problem, please leave a comment to avoid the issue from automatically closing.

alfaproject commented 1 year ago

Yes, there is still a problem, but it took us more than 6 months to do the necessary code changes for the migration so our budget for this migration has depleted and our roadmap is quite filled until end of the year. I'd still like to find a way in my own out-of-work free time to try and do a minimal repo or try to get more budget for this upgrade from work. Either way it's still a problem.

renanwilliam commented 10 months ago

I finished migrating a large service from v2 to v3 yesterday, and we are experiencing several memory overflow and timeout issues in Lambda functions that we never had before. It's hard to say precisely what it might have been, but it's related to SDK v3. It seems to me that there are a lot of default client initialization values that were previously not defined the way they are.

RanVaknin commented 9 months ago

Hi @renanwilliam,

Can you please open a separate issue, share the actual error message you are seeing and your entire lambda code so we can help you effectively?

Thanks, Ran~

renanwilliam commented 9 months ago

I have investigated our performance issue and found that it is unrelated to SDK v3. Briefing the investigation:

We have migrated the NodeJS version from v14 to v18
The Lambda is called multiple times during the day each ~5 minutes, so containers are always warm, and behave like a long-running process
The bad guy is "--max-old-space-size" parameter. The default value is 512MB, the same amount of memory as my Lambda. So it's waits fill all 512MB before garbage collector starts. This article has helped a lot.

andreujuanc commented 6 months ago

Lets play a game: Guess when we migrated from node16/aws-sdk to node18/@aws-sdk

RanVaknin commented 4 months ago

Hi @renanwilliam ,

Im happy to hear that this was not due to the SDK.

@andreujuanc, without further details about your specific performance implications it's hard to take action. Please create a separate issue highlighting the differences, the environment from which you are running it, your code, and any other information that would enable us to help you.

Since no reproduction code was given, and other concerns have seemed to have been addressed Im going to close this issue. If this persists for anyone else, please open a separate issue and provide as much detail as possible so we can better assist you.

All the best, Ran~

andreujuanc commented 4 months ago

Since no reproduction code was given, and other concerns have seemed to have been addressed Im going to close this issue. If this persists for anyone else, please open a separate issue and provide as much detail as possible so we can better assist you.

We literally just swapped libraries from v2 to v3 and updated from node 16 to 18.

renanwilliam commented 4 months ago

Have you been able to check my comment? Is related to node 18 memory handling and the "--max-old-space-size" parameter.

github-actions[bot] commented 4 months ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs and link to relevant comments in this thread.

aws / aws-sdk-js-v3