(cli): cdk deploy "fail: socket hang up" error on Big Sur w/ AnyConnect

whiskeylover commented 3 years ago

Deployment using cdk deploy fails with fail: socket hang up. This has started happening since upgrading to Big Sur (MacOS). I opened a support ticket with AWS, and they directed me to here.

I have upgraded npm and node and aws utils to the latest versions as asked by the AWS support.

Reproduction Steps

cdk deploy

Output: -

[0%] start: Publishing 53bc1c8ff3460cde96f053e9ef268efbccac51185d8c46bf5acb064ff9afdf1c:current
[0%] check: Check s3://[bucket-name]/assets/53bc1c8ff3460cde96f053e9ef268efbccac51185d8c46bf5acb064ff9afdf1c.jar
[0%] upload: Upload s3://[bucket-name]/assets/53bc1c8ff3460cde96f053e9ef268efbccac51185d8c46bf5acb064ff9afdf1c.jar
[25%] fail: socket hang up
[25%] start: Publishing 67b7823b74bc135986aa72f889d6a8da058d0c4a20cbc2dfc6f78995fdd2fc24:current
[25%] check: Check s3://[bucket-name]/assets/67b7823b74bc135986aa72f889d6a8da058d0c4a20cbc2dfc6f78995fdd2fc24.zip
[25%] found: Found s3://[bucket-name]/assets/67b7823b74bc135986aa72f889d6a8da058d0c4a20cbc2dfc6f78995fdd2fc24.zip
[50%] success: Published 67b7823b74bc135986aa72f889d6a8da058d0c4a20cbc2dfc6f78995fdd2fc24:current
[50%] start: Publishing 0540b0b3a6863c6d68b73f1f5368b2832a4ffa5dfa77d4308941ee46eff41d21:current
[50%] check: Check s3://[bucket-name]/assets/0540b0b3a6863c6d68b73f1f5368b2832a4ffa5dfa77d4308941ee46eff41d21.jar
[50%] upload: Upload s3://[bucket-name]/assets/0540b0b3a6863c6d68b73f1f5368b2832a4ffa5dfa77d4308941ee46eff41d21.jar
[75%] fail: socket hang up
[75%] start: Publishing 692a0f095ccf744c65ed666353d5a527a0a8a36fa75759113c1da6ccad12f359:current
[75%] check: Check s3://[bucket-name]/assets/692a0f095ccf744c65ed666353d5a527a0a8a36fa75759113c1da6ccad12f359.zip
[75%] found: Found s3://[bucket-name]/assets/692a0f095ccf744c65ed666353d5a527a0a8a36fa75759113c1da6ccad12f359.zip
[100%] success: Published 692a0f095ccf744c65ed666353d5a527a0a8a36fa75759113c1da6ccad12f359:current

 ❌  [Stackname] failed: Error: Failed to publish one or more assets. See the error messages above for more information.
    at Object.publishAssets (/usr/local/lib/node_modules/aws-cdk/lib/util/asset-publishing.ts:25:11)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at Object.deployStack (/usr/local/lib/node_modules/aws-cdk/lib/api/deploy-stack.ts:232:3)
    at CdkToolkit.deploy (/usr/local/lib/node_modules/aws-cdk/lib/cdk-toolkit.ts:180:24)
    at initCommandLine (/usr/local/lib/node_modules/aws-cdk/bin/cdk.ts:210:9)
Failed to publish one or more assets. See the error messages above for more information.
Error: Failed to publish one or more assets. See the error messages above for more information.
    at Object.publishAssets (/usr/local/lib/node_modules/aws-cdk/lib/util/asset-publishing.ts:25:11)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at Object.deployStack (/usr/local/lib/node_modules/aws-cdk/lib/api/deploy-stack.ts:232:3)
    at CdkToolkit.deploy (/usr/local/lib/node_modules/aws-cdk/lib/cdk-toolkit.ts:180:24)
    at initCommandLine (/usr/local/lib/node_modules/aws-cdk/bin/cdk.ts:210:9)

What did you expect to happen?

Successful deployment of artifacts.

What actually happened?

The deployment fails due to upload timeout from cdk to S3

Environment

CDK CLI Version : 1.108.1 (build ae24d8a)
Framework Version: aws-cli/2.2.12 Python/3.9.5 Darwin/20.5.0 source/x86_64 prompt/off
Node.js Version: v16.3.0
OS : MaOS Big Sur v. 11.4
Language (Version): Java: openjdk 11.0.10 2021-01-19. Python: Python 3.9.5

Other

This is :bug: Bug Report

whiskeylover commented 3 years ago

I also use Cisco AnyConnect Secure Mobility Client. We had it upgraded to 4.9.06037 a couple weeks ago before the OS upgrade to Big Sur. cdk deploy worked fine with the Cisco client update. It only broke after MacOS update.

ryparker commented 3 years ago

Hey @whiskeylover :wave:

Thanks for opening this issue. I'm currently using the same MacOS and Node version as you with Cisco AnyConnect at v4.9.05042 and not able to reproduce your error. Not saying the issue is with Cisco but just something to note.

I've got a couple of questions:

Which language is your CDK code written in?
Have you built the CDK bootstrap stack in your AWS environment? ie. npx cdk bootstrap
Do you have the CDK installed as a global NPM package? ie. npm install -g aws-cdk

Some things to try:

Try deploying with npx cdk deploy
Update to the latest CDK version 1.109.0 (build c647e38) npm install -g aws-cdk
If you use brew, update brew by running brew upgrade & brew update

Try deploying one of the template CDK projects

mkdir simple-cdk-app
cd simple-cdk-app
cdk init --language typescript
npm install
npx cdk deploy

Try running with elevated logs npx cdk deploy --verbose

Any source code could also be useful. Looking forward to working through this with you 😃

whiskeylover commented 3 years ago

Hi Ryan,

Thanks for responding. The Cisco version is something pushed by IT and not something I can choose.

Here are the answers to your questions.

CDK code is written in TS
I just built the bootstrap stack and retried the deployment. It failed.
Yes, CDK is installed globally.

Also

I tried deploying the code with npx cdk deploy and I get a slightly different error.
- Instead of fail: socket hang up or fail: write EPIPE when deploying artifacts to S3 (see my original post), it now says fail: Inaccessible host: '[staging-bucket-name].s3.us-west-2.amazonaws.com'. This service may not be available in the 'us-west-2' region..
- I tried deploying it using the -v switch for verbose output and noticed that the deploy was failing when trying to upload the built artifacts (jar and zip files) to S3. It uploads the zip files successfully, but fails when uploading jar files, for some reason. If I manually upload the jar files to the same location and try again, it goes past the error points, and then fails at deploying cloud formation templates.
- I've found that other people are also having cdk deploy to S3 issues after upgrading to Big Sur. Here's a link.
I updated brew and installed the latest CDK version, and still get the same error.
The sample CDK project built and deployed fine. But I noticed it's not uploading any built artifacts to S3. Our project builds a java project, and uploads the JAR to a lambda. That's where it fails.

Thanks again for responding. Let me know what else I can provide you with.

Ashish

salhadef commented 3 years ago

@whiskeylover I have the same problem. For me, I'm using CDK's aws_lambda.Code.from_asset() to push some Lambda code. It just started to fail if a .so file is included in the zip. Removing the .so makes CDK deploy work fine.

Side note, I did notice manually updating a lambda using a zip file that contains a .so via the console works fine. l have not tried switching to Code.from_bucket() and pushing the zip to S3 first and then run CDK; I'm hoping AWS solves the problem before I have to refactor but it might be a short term hack if you need a work around ASAP.

whiskeylover commented 3 years ago

@salhadef I'm glad to know it's not just me :)

rix0rrr commented 3 years ago

We need a differential diagnosis on this. What is the smallest change that will turn a successful scenario into a breaking scenario? It will be hard for us to reproduce; I'm not on Big Sur and I personally haven't seen this.

It does not seem to be the presence of the .so file, since @whiskeylover didn't mention that they had that as well (if so, my guess would have been a virus scanner/firewall type software interfering with the upload).

Is it the presence of a Cisco client? I've seen connections on Cisco VPNs stall due to MTU issues (computer sending packets, and the VPN silently dropping them since the packets are too big). Try setting the MTU to 1280 and seeing if that makes a difference?

whiskeylover commented 3 years ago

Hi @rix0rrr, thanks for replying.

I set the MTU to 1280, and still get the same error.

Also, as I mentioned in my previous post, my uploads fail for .jar, but .zip uploads go through. But if I manually upload the files to the S3 location, the deployment proceeds past this point, and gets stuck again at a later point where it's trying to upload the cloud formation template.

Here's a screenshot.

jsauter commented 3 years ago

I am running Cisco AnyConnect v4.9.05042 and am getting the error.

Also note, a colleague @john-shaskin did a parking lot deploy on our network without the VPN and received the same issue.

I comparison on the #15278, I receive the disconnect error after asset uploading when the changeset is being created.

BenChaimberg commented 3 years ago

@whiskeylover @salhadef How large are the bundled assets that you are trying to upload? Might be helpful for reproduction

whiskeylover commented 3 years ago

@BenChaimberg my jar files are around 59 MB.

BenChaimberg commented 3 years ago

I was unable to successfully reproduce this error by creating a similarly-sized asset (~67 MB)

macOS: 11.4 (20F71) CLI: 1.109.0 Framework: 1.109.0 Cisco AnyConnect: 4.9.05042

Creating a large asset:

mkdir assets
hexdump -n 20000000 </dev/random >./assets/randhex

Stack definition:

import { Construct, Stack, StackProps } from '@aws-cdk/core';
import * as s3assets from '@aws-cdk/aws-s3-assets';

export class TestAppStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);

    new s3assets.Asset(this, 'Large', {
      path: 'assets/randhex',
    });
  }
}

Deploy:

VeryLargeAsset: deploying...
[0%] start: Publishing 95d1804d2f15c28f85b4e56f39f8af258acf2678399408fe9f0e605807dafc16:current
[100%] success: Published 95d1804d2f15c28f85b4e56f39f8af258acf2678399408fe9f0e605807dafc16:current
VeryLargeAsset: creating CloudFormation changeset...

 ✅  VeryLargeAsset

Stack ARN:
arn:aws:cloudformation:us-west-2:111122223333:stack/VeryLargeAsset/3680f040-d528-11eb-aa46-0633937de7fd

jsauter commented 3 years ago

Ok, I modified my huge template creator app that was failing on my work macbooks on 11.4 with the ECONNRESET errors, pushed it to github (https://github.com/jsauter/bigcdktest) and cloned it to my personal macbook on Big Sur 11.4. You guessed it, works fine, even with a massive template.

Probably has something to do with AMP/Anyconnect messing with the connection... thoughts?

Please note, I am still not getting errors with the s3 portion, just the changeset creation after the asset upload.

rix0rrr commented 3 years ago

Please note, I am still not getting errors with the s3 portion, just the changeset creation after the asset upload.

That is even weirder...

rix0rrr commented 3 years ago

Could someone who's experience this take out tcpdump or wireshark and see if they're something particular they can figure out about the packets? Maybe there's one that gets dropped with a particular flag, or a particular size, or a particular byte sequence in them...

Not a network engineer myself so it's hard for me to give more guidance than that, but it might be a worthwhile avenue of exploration...

rix0rrr commented 3 years ago

Some other things to try out:

After synthing with assets that trigger the issue:

Does running cdk-assets on the cdk.out directory directly reproduce the issue as well?
Does running aws s3 cp reproduce the issue?

What's the difference between those cases at the wire level?

rix0rrr commented 3 years ago

Also, as I mentioned in my previous post, my uploads fail for .jar, but .zip uploads go through

Are they the same file, just with a different extension? Or do they have different contents? Sizes?

whiskeylover commented 3 years ago

Does running aws s3 cp reproduce the issue?

No. Manual copy succeeds. If I retry cdk deploy after a manual copy, the installation proceeds beyond the error points, but then fails later at the template creation phase.

Are they the same file, just with a different extension? Or do they have different contents? Sizes?

They're different. The .zip files are very small (a few MBs). The .jar files are usually 50+ MBs.

If you look at my first post, you'll see that there are 4 assets to be uploaded (1st is a jar file, 2nd is a zip, 3rd is a jar and the 4th is a zip). They're all different files. The two jars fail and the two zips succeed.

rix0rrr commented 3 years ago

It's probably not about the .jar but about the size, then. As in, any asset large enough would show the same behavior.

It would be great if we could make a statement like:

Assets bigger than 73 MB trigger this error, smaller assets don't.

Or whatever the exact number is.

jsauter commented 3 years ago

Could someone who's experience this take out tcpdump or wireshark and see if they're something particular they can figure out about the packets? Maybe there's one that gets dropped with a particular flag, or a particular size, or a particular byte sequence in them...

Not a network engineer myself so it's hard for me to give more guidance than that, but it might be a worthwhile avenue of exploration...

Oh man, I know I am in trouble when I crack open wireshark. I will see what I can figure out.

jsauter commented 3 years ago

So, filtering on cloudformation and then the IP of the amazon server, on the failed changeset creations we get a RST request back from AWS. On the succeeding cases, we do not get the reset.

Failure

Success

jsauter commented 3 years ago

Yes, confirmed it is a cisco amp/anyconnect issue. I was able to unload the cisco processes and my test project deployed fine. I will run this by my desktop team to see if there are any updates we can try.

whiskeylover commented 3 years ago

Yes, confirmed it is a cisco amp/anyconnect issue. I was able to unload the cisco processes and my test project deployed fine. I will run this by my desktop team to see if there are any updates we can try.

I get the same errors when trying to deploy with the VPN disconnected.

jsauter commented 3 years ago

Yes, confirmed it is a cisco amp/anyconnect issue. I was able to unload the cisco processes and my test project deployed fine. I will run this by my desktop team to see if there are any updates we can try.

I get the same errors when trying to deploy with the VPN disconnected.

I don't think it is the Cisco VPN specifically.

I ran the following:

sudo launchctl unload /Library/LaunchDaemons/com.cisco.amp.daemon.plist sudo launchctl unload /Library/LaunchDaemons/com.cisco.amp.updater.plist sudo launchctl unload /Library/LaunchDaemons/com.cisco.anyconnect.vpnagentd.plist

and then killed these 3 processes in Activity Monitor:

After that, I ran my large template app and it succeeded in creating the change set.

Note, I am deploying this test template to my personal account since I nuked the VPN with the unload. You would not be able to deploy to your organization's account if you require the VPN to be there.

I checked my Cisco AMP version and I am at 'latest' according to what we have published at the office. I am not sure at this point how to actually fix the issue.

rix0rrr commented 3 years ago

@jsauter thanks for posting that trace. It looks like it's only half of the connection--only the parts that the server sends back.

Do you still have the other part? I'm guessing there's something the client sends that the server doesn't like...

rix0rrr commented 3 years ago

Also... The issue seems to be during connection termination???

rix0rrr commented 3 years ago

I asked our internal MacOS engineering team and this is what they said:

Big Sur is when companies started switching to the new network system extension and away from kernel extensions. The SysExt is a new feature and thus not as matured as kernel extensions. I've also noticed some odd network behavior from Cisco AnyConnect recently and at least one report of AnyConnect breaking something network related even when it wasn't connected.

If you have some internal customers that can reproduce the issues I'd like to open an AppleCare case and have Apple take a look at the logs.

Another thing that may help is updating AnyConnect from 4.9 to 4.10. I noticed a significant decrease in network issues after doing that.

If anyone in this thread who is experiencing this problem is an Amazon employee, reach out to me privately and we will work with the Mac SME to figure out how to send a log to Apple.

whiskeylover commented 3 years ago

One thing I noticed was that upgrading AnyConnect to 4.9.06037 from an earlier version of 4.8.xx didn't pose any issues for me. It was only after upgrading my OS to Big Sur is when I started getting the cdk errors.

rix0rrr commented 3 years ago

@whiskeylover our mac expert said that in Big Sur AnyConnect starts using a different mechanism to hook into the kernel.

So you would only see it when both factors are true: recent AnyConnect PLUS Big Sur, and you will see it starting with whatever update comes last.

jsauter commented 3 years ago

@rix0rrr

Here are both sides of the conversation with the failure, exported as a wireshark file.

AWSConversationWithIssue.pcapng.zip

rix0rrr commented 3 years ago

For the people that are affected by this, can you confirm whether or not the code tested by @BenChaimberg reproduces the issue for you? https://github.com/aws/aws-cdk/issues/15231#issuecomment-867927383

We are unable to reproduce using that (multiple people) but maybe it also doesn't reproduce for you all with that?

jsauter commented 3 years ago

For the people that are affected by this, can you confirm whether or not the code tested by @BenChaimberg reproduces the issue for you? #15231 (comment)

We are unable to reproduce using that (multiple people) but maybe it also doesn't reproduce for you all with that?

This is what I received:

cdk deploy --profile saml
BigfiletestStack: deploying... [0%] start: Publishing 322ae385170560790077d66b6be8253af9b495b8f4ef58e3625ed252ff17be74:current [100%] fail: write EPIPE

❌ BigfiletestStack failed: Error: Failed to publish one or more assets. See the error messages above for more information. at Object.publishAssets (/Users/jsauter/.nvm/versions/node/v12.18.3/lib/node_modules/aws-cdk/lib/util/asset-publishing.ts:25:11) at processTicksAndRejections (internal/process/task_queues.js:97:5) at Object.deployStack (/Users/jsauter/.nvm/versions/node/v12.18.3/lib/node_modules/aws-cdk/lib/api/deploy-stack.ts:232:3) at CdkToolkit.deploy (/Users/jsauter/.nvm/versions/node/v12.18.3/lib/node_modules/aws-cdk/lib/cdk-toolkit.ts:180:24) at initCommandLine (/Users/jsauter/.nvm/versions/node/v12.18.3/lib/node_modules/aws-cdk/bin/cdk.ts:212:9) Failed to publish one or more assets. See the error messages above for more information.

jsauter commented 3 years ago

@rix0rrr

Here are two cases using @BenChaimberg's test. One is a larger 67mb file, the other is a small 67kb file.

deployFailureLargerFile.txt deployFailureSmallerFile.txt

rix0rrr commented 3 years ago

Did a small analysis of the end of the connection here. The root cause is that the server apparently spontaneously decides to close the connection while the client is still happily sending, and this manifests in the way we're seeing (with a connection reset):

Question is now--WHY does the server decide to spontaneously close the connection? At this point, the server has sent more (5418 bytes) than the client (2411 bytes).

rix0rrr commented 3 years ago

There are 3 connection resets in the captured trace. All of them are at exactly the same place in the stream. This must be independent of the files sent, but still happening during the HTTP request phase of the transmission (before the actual upload)

jsauter commented 3 years ago

Let me try to run the same scenario, with my Catalina machine, and I will record it with wireshark. Since it is working, perhaps there will be some evidence of Cisco munging things up.

rix0rrr commented 3 years ago

Also all of them have the exact same "spurious retransmission" identified by WireShark... pretty sus!

rix0rrr commented 3 years ago

You might also trying using https://mitmproxy.org/ to try and capture the traffic in readable form and seeing if that turns anything up.

Quick-fire instructions:

$ pip3 install mitmproxy
$ mitmdump -p 8080 --ssl-insecure -v -ddd

# different tab
$ export https_proxy=http://localhost:8080/
$ export NODE_TLS_REJECT_UNAUTHORIZED=0
$ cdk deploy ...

rix0rrr commented 3 years ago

After 1200 bytes sent by the client, the server decides they don't like the client. Wonder what takes 1200 bytes to send in a TLS connection... maybe mitmproxy can tell us.

jsauter commented 3 years ago

Alright, this is pretty wild... When I have the mitmdump proxy running, I don't receive the ECONNRESET error and it deploys correctly. Not sure if it is the mitmdump process or the exports you had me put in. 🤣

FYI I had to remove the -ddd as it was not being accepted.

rix0rrr commented 3 years ago

FYI I had to remove the -ddd as it was not being accepted.

Ah sorry, it might have been -vvv

rix0rrr commented 3 years ago

Alright, this is pretty wild... When I have the mitmdump proxy running, I don't receive the ECONNRESET error and it deploys correctly. Not sure if it is the mitmdump process or the exports you had me put in. 🤣

Most likely it's because it's a different process doing the actual connection over the VPN to AWS; it's now the mitmproxy doing the actual connection, instead of node.

That's terrible though, without being able to inspect the actual data stream this is going to be impossible to debug.

rix0rrr commented 3 years ago

Could this be related? https://github.com/nodejs/node/issues/36826

Are you on M1 or Intel architecture?

jsauter commented 3 years ago

Could this be related? nodejs/node#36826

Are you on M1 or Intel architecture?

We have been all on Intel thus far.

jsauter commented 3 years ago

It does look similar enough that I wonder if it could be related.

jsauter commented 3 years ago

Here is the same test wireshark conversation, working in Catalina, with the cisco stuff running.

workingCatalina.pcapng.zip

rix0rrr commented 3 years ago

The sequence numbers in that trace are slightly off from the other one. For whatever reason, the sequence numbers of packets in the general vicinity of where the Big Sure trace would cut off are [419, 545, 1089] instead of [519, 1199]. No idea if that's even relevant, though.

@jsauter, I'm very sorry, but I don't know how to help anymore.

jsauter commented 3 years ago

@jsauter, I'm very sorry, but I don't know how to help anymore.

Did you all open a ticket with Apple and/or Cisco?

SeriousAnt commented 3 years ago

You might also trying using https://mitmproxy.org/ to try and capture the traffic in readable form and seeing if that turns anything up.

Quick-fire instructions:
$ pip3 install mitmproxy
$ mitmdump -p 8080 --ssl-insecure -v -ddd

# different tab
$ export https_proxy=http://localhost:8080/
$ export NODE_TLS_REJECT_UNAUTHORIZED=0
$ cdk deploy ...

This is really interesting @rix0rrr. I was able to publish and deploy all assets using the proxy as you have described. Without the proxy I get socket hangup or Inaccessible host ... errors

OSX: 11.4 (20F71) Node: v14.15.3 CDK: 1.85

Will use this as a workaround till I figure it out 🙈

whiskeylover commented 3 years ago

@SeriousAnt, same. The proxy works for me, and I'll use it as a workaround for now.

rix0rrr commented 3 years ago

Did you all open a ticket with Apple and/or Cisco?

No, we need an internal Amazonian to get in touch with our client engineering folk to set up a useful trace, and so far no one has stepped up.

aws / aws-cdk