aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.68k stars 3.93k forks source link

cli: Socket timed out without establishing a connection when --asset-parallelism=true #19930

Open apoorvmote opened 2 years ago

apoorvmote commented 2 years ago

Describe the bug

I have anywhere between 20-50 nodejs lambda functions in single stack and I update their dependencies and deploy with cdk.

But lately I am not able to deploy updates. I get following error when I deploy.

current credentials could not be used to assume 'arn:aws:iam::******:role/cdk-hnb659fds-lookup-role-******-us-east-1', but are for the right account. Proceeding anyway.
(To get rid of this warning, please upgrade to bootstrap version >= 8)
current credentials could not be used to assume 'arn:aws:iam::******:role/cdk-hnb659fds-file-publishing-role-******-us-east-1', but are for the right account. Proceeding anyway.
current credentials could not be used to assume 'arn:aws:iam::******:role/cdk-hnb659fds-file-publishing-role-******-us-east-1', but are for the right account. Proceeding anyway.
[9%] fail: Socket timed out without establishing a connection
[18%] fail: Socket timed out without establishing a connection

I keep trying again and again and sometimes it goes through and most of the time it doesn't work. Only stack with lower number of lambda functions sometimes gets deployed. But stack with large number of lambda functions fails 100% of the time.

Expected Behavior

I expected it to deploy no matter number of lambda functions in the stack. It used to get deployed without any problem.

Current Behavior

current credentials could not be used to assume 'arn:aws:iam::******:role/cdk-hnb659fds-lookup-role-******-us-east-1', but are for the right account. Proceeding anyway.
(To get rid of this warning, please upgrade to bootstrap version >= 8)

I don't know how to upgrade bootstrap version. I ran cdk bootstrap multiple times and it says no changes.

Reproduction Steps

const testSignUpFn = new NodejsFunction(this, 'testSignUpNodeJS', {
      runtime: Runtime.NODEJS_14_X,
      entry: `${__dirname}/../lambda-fns/sign-up/index.ts`,
      handler: 'signUp',
      architecture: Architecture.ARM_64,
      memorySize: 1024
    })

It was working before but suddenly stopped working.

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.20.0 (build 738ef49)

Framework Version

No response

Node.js Version

v16.14.2

OS

Ubuntu 20.04 on WSL 2

Language

Typescript

Language Version

~3.9.7

Other information

No response

corymhall commented 2 years ago

@apoorvmote do you know what changed between when it used to work and now? Did you recently update the version of the CDK?

apoorvmote commented 2 years ago

Of course I am regularly changing CDK versions as it comes out. But in my opinion problem is not with CDK. But AWS is denying my specific IP address whenever I do large update for 30-50 functions. All the other stacks has very little update and it always goes through.

Also I have another cdk project that I run in docker development environment and I am able to deploy over 50 functions without any problem from same computer. I can run that project in docker because all functions are nodejs built with esbuild. This (failing) project has functions written in golang and I use docker for building golang functions. I am not able to run docker inside docker so long story long I run this in WSL and deploy normally and it fails. But if I build it on WSL and docker then deploy from docker development environment because golang function is already built then it gets deployed.

corymhall commented 2 years ago

@apoorvmote can you try this with a version of the CDK <2.17.0/1.149.0? We added parallel asset publishing starting in those versions and i'm curious if that could be the issue.

github-actions[bot] commented 2 years ago

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

apoorvmote commented 2 years ago

I did upgrade to 2.22.0 and suddenly problem is disappeared. If the problem appears again then I will open another issue.

github-actions[bot] commented 2 years ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

agusavior commented 2 years ago

It happened to me at CDK version 2.31.1 (build 42432c6) I was trying to deploy a Cloudfront distribution with a S3 bucket. I don't know how to fix it.

viktorchukhantsev commented 2 years ago

Facing with this issue regularly now on 2.39.1 When I enable vpn and deploy again this error disappear so looks like this is somehow related to connection establishing issue.

hassaanakram commented 2 years ago

Facing with this issue regularly now on 2.39.1 When I enable vpn and deploy again this error disappear so looks like this is somehow related to connection establishing issue.

facing the same issue. I'm able to deploy several stacks that have lower resources but my lambdas aren't going through. Proxy seems to do the trick.

System information: OS: macOS Monterey v12.4 CDK verison: 2.39.1

anthony-mills commented 1 year ago

I get the same problem with the CDK 2.52.0 seems related to IP or connection somehow.

Will get socket timeouts while trying to deploy:

[09:54:23] Assuming role failed: Socket timed out without establishing a connection
[09:54:23] Could not assume role in target account using current credentials Socket timed out without establishing a connection . Please make sure that this role exists in the account. If it doesn't exist, (re)-bootstrap the environment with the right '--trust', using the latest version of the CDK CLI.

Switch to another internet connection ( phone hot spot or similar ) and the problem goes away.

System information: OS: Linux Mint 21 Cinnamon NodeJS Version: v18.12.1 CDK verison: 2.52.0

jscrobinson commented 1 year ago

This does appear to be related to the Asset Parallelism feature. Executing a deployment with --asset-parallelism=false resulted in a successful deployment.

When running without --asset-parallelism=false the stack failed on the following error:

Call failed: listObjectsV2({"Bucket":"cdk-hnb659fds-assets-ACCOUNT_ID-eu-west-2","Prefix":"0936406e22fea26017ecca536fcbdc550936406e22fea26017ecca536fcbdc55.zip","MaxKeys":1}) => Socket timed out without establishing a connection (code=TimeoutError)

There are only four assets in the bucket and none of them are over 50KB.

System information: OS: Ubuntu 20.04 NodeJS Version: v16.3.0 CDK verison: 2.51.1

anthony-mills commented 1 year ago

Thanks @jscrobinson I can confirm your finding! If I try to deploy normally it fails, but try again with the --asset-parallelism=false flag and the deployment succeeds.

Just so happy to have a work around at the moment that doesn't involve finding a new internet connection. :smiley:

yuri1969 commented 1 year ago

I've encountered the same issue using 2.53.0 (build 7690f43) deploying a fleet of tens of Lambdas with --concurrency=50.

The workaround using --asset-parallelism=false seems to fix that.

hamilton-earthscope commented 1 year ago

We are running into this same issue when using the s3.BucketDeployment construct. Using --asset-parallelism=false fixes it. Thanks for the tip!

oliversalzburg commented 1 year ago

We also have to use the --asset-parallelism=false workaround to be able to deploy at all. With 2.83, a new parallelism feature was introduced to improve performance. Now our deployments are entirely broken, regardless of --asset-parallelism.

In general, a real solution for the underlying issue would be appreciated.

In case it helps, we only see the problematic behavior when deploying from GitHub Actions. If we run the same deploy locally, it completes dramatically faster and without issues. So far, all our research regarding environment differences have been fruitless.

oliversalzburg commented 1 year ago

We conducted further research into this. It seems like what CDK calls "parallelism" is just waiting for multiple promises on the same single thread, there is no work happening in parallel at all. This is combined with the extremely poor single-core performance of the GitHub Actions runner fleet, and you end up with a fully saturated core for the entire runtime of your pipeline, regardless of how many cores you give it.

When I asked AWS reps about this, they told me that using the public runner fleet is a bad choice to begin with. You probably want to invest in some fat self-hosted runner with a single 5GHz core.

I'm pressing our client to move away from CDK ASAP, but we will likely solve this problem with money in the mid-term. This is not a good product.

oliversalzburg commented 1 year ago

Turns out our issue was caused by setting NODE_OPTIONS=--enable-source-maps in our deployment pipeline.

CDK is compiled into a single 28 MB .js file, accompanied with a 58 MB source map. This causes excessive load, especially due to the high parallelism that CDK uses. I have patched out all the unqueued IO processes and replaced all the hardcoded parallelization values with require("os").cpus().length. This resolved our timeouts and we were able to deploy again.

Soon after, we realized that deployment performance was dramatically improved by upgrading to Node@20. This is due to this change in Node@19.6. Previously, we ran Node@18 LTS, which was also the highest supported version of CDK at the time. This change in Node@19.6 introduces caching for the parsed source maps, which resolves this whole problem entirely (for us).

I stand by my point that the way CDK handles IO is ridiculous. I also think bundling a NodeJS module into a single 28 MB file, with a 58 MB source map is ridiculous.

As Node@18 is also the latest supported runtime by AWS Lambda, be cautious when using --enable-source-maps at runtime, because similar performance issues can be observed there, especially during exception handling.

p.s.: The reason it worked for us locally was, that nobody set --enable-source-maps locally, or people were already on Node@20 locally.

tanpenggood commented 1 year ago

😭 😭 😭

I encountered the same problem while deploying the project aws-samples/amazon-codewhisperer-workshop.

And I tried using cdk deploy --all --asset-parallelism=false, the same error was throw.

Log

> cdk deploy --all

✨  Synthesis time: 7.3s

APIStack:  start: Building ba88964563976f2e7ba608a7bff3e66649cfc355fc656f357ee1cfd4981bc6aa:current_account-ap-southeast-2
APIStack:  success: Built ba88964563976f2e7ba608a7bff3e66649cfc355fc656f357ee1cfd4981bc6aa:current_account-ap-southeast-2
APIStack:  start: Building 3a167ad57f1fe716bf6aaecc1338dfc52e374149f35acd5ad6acba509938ae8d:current_account-ap-southeast-2
APIStack:  success: Built 3a167ad57f1fe716bf6aaecc1338dfc52e374149f35acd5ad6acba509938ae8d:current_account-ap-southeast-2
APIStack:  start: Publishing ba88964563976f2e7ba608a7bff3e66649cfc355fc656f357ee1cfd4981bc6aa:current_account-ap-southeast-2
IntegrationStack:  start: Building 0a7920ffc66926b7d6a37a65e729ce9c41a24a09d17a0be9db60a7c8e789a691:current_account-ap-southeast-2
IntegrationStack:  success: Built 0a7920ffc66926b7d6a37a65e729ce9c41a24a09d17a0be9db60a7c8e789a691:current_account-ap-southeast-2
IntegrationStack:  start: Building bfa23dd275e652257d6dd3b8d94380e2ff57ee161fcc742970e9a67a2268c685:current_account-ap-southeast-2
IntegrationStack:  success: Built bfa23dd275e652257d6dd3b8d94380e2ff57ee161fcc742970e9a67a2268c685:current_account-ap-southeast-2
RekognitionStack:  start: Building 7eff58b160d8d2dfb14b5ecabd9f6625f572f84607bb7941df702ea8198546cb:current_account-ap-southeast-2
RekognitionStack:  success: Built 7eff58b160d8d2dfb14b5ecabd9f6625f572f84607bb7941df702ea8198546cb:current_account-ap-southeast-2
RekognitionStack:  start: Building 2c64aeb833819272233efeec105e713968f201118f64a2a58fd01ffef5bdeca5:current_account-ap-southeast-2
RekognitionStack:  success: Built 2c64aeb833819272233efeec105e713968f201118f64a2a58fd01ffef5bdeca5:current_account-ap-southeast-2
APIStack:  fail: Socket timed out without establishing a connection

 ❌ Deployment failed: Error: Failed to publish asset ba88964563976f2e7ba608a7bff3e66649cfc355fc656f357ee1cfd4981bc6aa:current_account-ap-southeast-2
    at Deployments.publishSingleAsset (/Users/sam/.nvm/versions/node/v20.6.1/lib/node_modules/aws-cdk/lib/index.js:446:11458)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Object.publishAsset (/Users/sam/.nvm/versions/node/v20.6.1/lib/node_modules/aws-cdk/lib/index.js:446:151474)
    at async /Users/sam/.nvm/versions/node/v20.6.1/lib/node_modules/aws-cdk/lib/index.js:446:136916

Failed to publish asset ba88964563976f2e7ba608a7bff3e66649cfc355fc656f357ee1cfd4981bc6aa:current_account-ap-southeast-2

Env

> node -v
v20.6.1

> cdk --version
2.96.2 (build 3edd240)

> npm -v
9.8.1

> sw_vers
ProductName:    Mac OS X
ProductVersion: 10.15.7
BuildVersion:   19H2026

Solution

I switched the region from ap-southeast-2 to us-west-2 and successfully deployed the application.

😃 😃 😃

juanesmendez commented 1 month ago

I solved this issue changing the DNS server settings in my Wifi network from automatic, to manual. I am using Google DNS servers now (8.8.8.8, 8.8.4.4, 1.1.1.1, 1.0.0.1) and my issue was solved. Apparently it was an issue with the default DNS server my ISP uses.

rix0rrr commented 1 month ago

Good find on the source maps @oliversalzburg. I wonder if disabling source maps for the CLI will help everyone.

I'm still mystified about how having a source map could lead to Socket timed out without establishing a connection... but apparently it does?

@sumupitchayan maybe you can dedicate a quick Google to this error message and see what could be causing it. Plus, make sure that our default concurrency settings aren't too insane. And remove the source maps from the CLI?

oliversalzburg commented 4 weeks ago

@rix0rrr In older versions of NodeJS, the source maps were not cached. They were re-evaluated every time a call would pass through the minified module. Because this would happen for every single asset build on a tiny GitHub CI runner on the public fleet, the machine would be fully saturated with source map processing, so that it wasn't able to handle socket communication anymore. At least that's what I remember about it. Issue went away with Node20, I believe. They added a cache for the source maps and it was a whole new world.

I still believe that publishing minifyied/bundled NodeJS modules is counter-productive on many levels. This could have been avoided entirely.

rix0rrr commented 3 weeks ago

I still believe that publishing minifyied/bundled NodeJS modules is counter-productive on many levels. This could have been avoided entirely.

We are not doing that lightly either. This has been a learning due to real experiences with problematic dependencies in the past, and concerns around supply chain attacks in the ecosystem in general. In lieu of a properly supported shrinkwrapping mechanism that works across NPM, Yarn, PNPM and other potential JavaScript package managers, we've decided that bundling is the most reliable way to lock our dependency set to a known good one. (And if we're bundling, might as well minify...)

I understand your concerns, but from our PoV it's the lesser of two evils.