Closed whiskeylover closed 2 years ago
I also use Cisco AnyConnect Secure Mobility Client. We had it upgraded to 4.9.06037 a couple weeks ago before the OS upgrade to Big Sur. cdk deploy
worked fine with the Cisco client update. It only broke after MacOS update.
Hey @whiskeylover :wave:
Thanks for opening this issue. I'm currently using the same MacOS and Node version as you with Cisco AnyConnect at v4.9.05042 and not able to reproduce your error. Not saying the issue is with Cisco but just something to note.
I've got a couple of questions:
npx cdk bootstrap
npm install -g aws-cdk
Some things to try:
npx cdk deploy
npm install -g aws-cdk
brew upgrade
& brew update
Try deploying one of the template CDK projects
mkdir simple-cdk-app
cd simple-cdk-app
cdk init --language typescript
npm install
npx cdk deploy
npx cdk deploy --verbose
Any source code could also be useful. Looking forward to working through this with you 😃
Hi Ryan,
Thanks for responding. The Cisco version is something pushed by IT and not something I can choose.
Here are the answers to your questions.
Also
npx cdk deploy
and I get a slightly different error.
fail: socket hang up
or fail: write EPIPE
when deploying artifacts to S3 (see my original post), it now says fail: Inaccessible host: '[staging-bucket-name].s3.us-west-2.amazonaws.com'. This service may not be available in the 'us-west-2' region.
. -v
switch for verbose output and noticed that the deploy was failing when trying to upload the built artifacts (jar and zip files) to S3. It uploads the zip files successfully, but fails when uploading jar files, for some reason. If I manually upload the jar files to the same location and try again, it goes past the error points, and then fails at deploying cloud formation templates.Thanks again for responding. Let me know what else I can provide you with.
Ashish
@whiskeylover I have the same problem. For me, I'm using CDK's aws_lambda.Code.from_asset()
to push some Lambda code. It just started to fail if a .so
file is included in the zip. Removing the .so makes CDK deploy work fine.
Side note, I did notice manually updating a lambda using a zip file that contains a .so
via the console works fine. l have not tried switching to Code.from_bucket()
and pushing the zip to S3 first and then run CDK; I'm hoping AWS solves the problem before I have to refactor but it might be a short term hack if you need a work around ASAP.
@salhadef I'm glad to know it's not just me :)
We need a differential diagnosis on this. What is the smallest change that will turn a successful scenario into a breaking scenario? It will be hard for us to reproduce; I'm not on Big Sur and I personally haven't seen this.
It does not seem to be the presence of the .so
file, since @whiskeylover didn't mention that they had that as well (if so, my guess would have been a virus scanner/firewall type software interfering with the upload).
Is it the presence of a Cisco client? I've seen connections on Cisco VPNs stall due to MTU issues (computer sending packets, and the VPN silently dropping them since the packets are too big). Try setting the MTU to 1280 and seeing if that makes a difference?
Hi @rix0rrr, thanks for replying.
I set the MTU to 1280, and still get the same error.
Also, as I mentioned in my previous post, my uploads fail for .jar
, but .zip
uploads go through. But if I manually upload the files to the S3 location, the deployment proceeds past this point, and gets stuck again at a later point where it's trying to upload the cloud formation template.
Here's a screenshot.
I am running Cisco AnyConnect v4.9.05042 and am getting the error.
Also note, a colleague @john-shaskin did a parking lot deploy on our network without the VPN and received the same issue.
I comparison on the #15278, I receive the disconnect error after asset uploading when the changeset is being created.
@whiskeylover @salhadef How large are the bundled assets that you are trying to upload? Might be helpful for reproduction
@BenChaimberg my jar files are around 59 MB.
I was unable to successfully reproduce this error by creating a similarly-sized asset (~67 MB)
macOS: 11.4 (20F71) CLI: 1.109.0 Framework: 1.109.0 Cisco AnyConnect: 4.9.05042
Creating a large asset:
mkdir assets
hexdump -n 20000000 </dev/random >./assets/randhex
Stack definition:
import { Construct, Stack, StackProps } from '@aws-cdk/core';
import * as s3assets from '@aws-cdk/aws-s3-assets';
export class TestAppStack extends Stack {
constructor(scope: Construct, id: string, props?: StackProps) {
super(scope, id, props);
new s3assets.Asset(this, 'Large', {
path: 'assets/randhex',
});
}
}
Deploy:
VeryLargeAsset: deploying...
[0%] start: Publishing 95d1804d2f15c28f85b4e56f39f8af258acf2678399408fe9f0e605807dafc16:current
[100%] success: Published 95d1804d2f15c28f85b4e56f39f8af258acf2678399408fe9f0e605807dafc16:current
VeryLargeAsset: creating CloudFormation changeset...
✅ VeryLargeAsset
Stack ARN:
arn:aws:cloudformation:us-west-2:111122223333:stack/VeryLargeAsset/3680f040-d528-11eb-aa46-0633937de7fd
Ok, I modified my huge template creator app that was failing on my work macbooks on 11.4 with the ECONNRESET errors, pushed it to github (https://github.com/jsauter/bigcdktest) and cloned it to my personal macbook on Big Sur 11.4. You guessed it, works fine, even with a massive template.
Probably has something to do with AMP/Anyconnect messing with the connection... thoughts?
Please note, I am still not getting errors with the s3 portion, just the changeset creation after the asset upload.
Please note, I am still not getting errors with the s3 portion, just the changeset creation after the asset upload.
That is even weirder...
Could someone who's experience this take out tcpdump or wireshark and see if they're something particular they can figure out about the packets? Maybe there's one that gets dropped with a particular flag, or a particular size, or a particular byte sequence in them...
Not a network engineer myself so it's hard for me to give more guidance than that, but it might be a worthwhile avenue of exploration...
Some other things to try out:
After synthing with assets that trigger the issue:
cdk-assets
on the cdk.out
directory directly reproduce the issue as well?aws s3 cp
reproduce the issue?What's the difference between those cases at the wire level?
Also, as I mentioned in my previous post, my uploads fail for .jar, but .zip uploads go through
Are they the same file, just with a different extension? Or do they have different contents? Sizes?
Does running aws s3 cp reproduce the issue?
No. Manual copy succeeds. If I retry cdk deploy
after a manual copy, the installation proceeds beyond the error points, but then fails later at the template creation phase.
Are they the same file, just with a different extension? Or do they have different contents? Sizes?
They're different. The .zip
files are very small (a few MBs). The .jar
files are usually 50+ MBs.
If you look at my first post, you'll see that there are 4 assets to be uploaded (1st is a jar file, 2nd is a zip, 3rd is a jar and the 4th is a zip). They're all different files. The two jars fail and the two zips succeed.
It's probably not about the .jar
but about the size, then. As in, any asset large enough would show the same behavior.
It would be great if we could make a statement like:
73 MB
trigger this error, smaller assets don't.Or whatever the exact number is.
Could someone who's experience this take out tcpdump or wireshark and see if they're something particular they can figure out about the packets? Maybe there's one that gets dropped with a particular flag, or a particular size, or a particular byte sequence in them...
Not a network engineer myself so it's hard for me to give more guidance than that, but it might be a worthwhile avenue of exploration...
Oh man, I know I am in trouble when I crack open wireshark. I will see what I can figure out.
So, filtering on cloudformation and then the IP of the amazon server, on the failed changeset creations we get a RST request back from AWS. On the succeeding cases, we do not get the reset.
Failure
Success
Yes, confirmed it is a cisco amp/anyconnect issue. I was able to unload the cisco processes and my test project deployed fine. I will run this by my desktop team to see if there are any updates we can try.
Yes, confirmed it is a cisco amp/anyconnect issue. I was able to unload the cisco processes and my test project deployed fine. I will run this by my desktop team to see if there are any updates we can try.
I get the same errors when trying to deploy with the VPN disconnected.
Yes, confirmed it is a cisco amp/anyconnect issue. I was able to unload the cisco processes and my test project deployed fine. I will run this by my desktop team to see if there are any updates we can try.
I get the same errors when trying to deploy with the VPN disconnected.
I don't think it is the Cisco VPN specifically.
I ran the following:
sudo launchctl unload /Library/LaunchDaemons/com.cisco.amp.daemon.plist sudo launchctl unload /Library/LaunchDaemons/com.cisco.amp.updater.plist sudo launchctl unload /Library/LaunchDaemons/com.cisco.anyconnect.vpnagentd.plist
and then killed these 3 processes in Activity Monitor:
After that, I ran my large template app and it succeeded in creating the change set.
Note, I am deploying this test template to my personal account since I nuked the VPN with the unload. You would not be able to deploy to your organization's account if you require the VPN to be there.
I checked my Cisco AMP version and I am at 'latest' according to what we have published at the office. I am not sure at this point how to actually fix the issue.
@jsauter thanks for posting that trace. It looks like it's only half of the connection--only the parts that the server sends back.
Do you still have the other part? I'm guessing there's something the client sends that the server doesn't like...
Also... The issue seems to be during connection termination???
I asked our internal MacOS engineering team and this is what they said:
Big Sur is when companies started switching to the new network system extension and away from kernel extensions. The SysExt is a new feature and thus not as matured as kernel extensions. I've also noticed some odd network behavior from Cisco AnyConnect recently and at least one report of AnyConnect breaking something network related even when it wasn't connected.
If you have some internal customers that can reproduce the issues I'd like to open an AppleCare case and have Apple take a look at the logs.
Another thing that may help is updating AnyConnect from 4.9 to 4.10. I noticed a significant decrease in network issues after doing that.
If anyone in this thread who is experiencing this problem is an Amazon employee, reach out to me privately and we will work with the Mac SME to figure out how to send a log to Apple.
One thing I noticed was that upgrading AnyConnect to 4.9.06037
from an earlier version of 4.8.xx
didn't pose any issues for me. It was only after upgrading my OS to Big Sur is when I started getting the cdk errors.
@whiskeylover our mac expert said that in Big Sur AnyConnect starts using a different mechanism to hook into the kernel.
So you would only see it when both factors are true: recent AnyConnect PLUS Big Sur, and you will see it starting with whatever update comes last.
@rix0rrr
Here are both sides of the conversation with the failure, exported as a wireshark file.
For the people that are affected by this, can you confirm whether or not the code tested by @BenChaimberg reproduces the issue for you? https://github.com/aws/aws-cdk/issues/15231#issuecomment-867927383
We are unable to reproduce using that (multiple people) but maybe it also doesn't reproduce for you all with that?
For the people that are affected by this, can you confirm whether or not the code tested by @BenChaimberg reproduces the issue for you? #15231 (comment)
We are unable to reproduce using that (multiple people) but maybe it also doesn't reproduce for you all with that?
This is what I received:
cdk deploy --profile saml
BigfiletestStack: deploying...
[0%] start: Publishing 322ae385170560790077d66b6be8253af9b495b8f4ef58e3625ed252ff17be74:current
[100%] fail: write EPIPE
❌ BigfiletestStack failed: Error: Failed to publish one or more assets. See the error messages above for more information. at Object.publishAssets (/Users/jsauter/.nvm/versions/node/v12.18.3/lib/node_modules/aws-cdk/lib/util/asset-publishing.ts:25:11) at processTicksAndRejections (internal/process/task_queues.js:97:5) at Object.deployStack (/Users/jsauter/.nvm/versions/node/v12.18.3/lib/node_modules/aws-cdk/lib/api/deploy-stack.ts:232:3) at CdkToolkit.deploy (/Users/jsauter/.nvm/versions/node/v12.18.3/lib/node_modules/aws-cdk/lib/cdk-toolkit.ts:180:24) at initCommandLine (/Users/jsauter/.nvm/versions/node/v12.18.3/lib/node_modules/aws-cdk/bin/cdk.ts:212:9) Failed to publish one or more assets. See the error messages above for more information.
@rix0rrr
Here are two cases using @BenChaimberg's test. One is a larger 67mb file, the other is a small 67kb file.
Did a small analysis of the end of the connection here. The root cause is that the server apparently spontaneously decides to close the connection while the client is still happily sending, and this manifests in the way we're seeing (with a connection reset):
Question is now--WHY does the server decide to spontaneously close the connection? At this point, the server has sent more (5418 bytes) than the client (2411 bytes).
There are 3 connection resets in the captured trace. All of them are at exactly the same place in the stream. This must be independent of the files sent, but still happening during the HTTP request phase of the transmission (before the actual upload)
Let me try to run the same scenario, with my Catalina machine, and I will record it with wireshark. Since it is working, perhaps there will be some evidence of Cisco munging things up.
Also all of them have the exact same "spurious retransmission" identified by WireShark... pretty sus!
You might also trying using https://mitmproxy.org/ to try and capture the traffic in readable form and seeing if that turns anything up.
Quick-fire instructions:
$ pip3 install mitmproxy
$ mitmdump -p 8080 --ssl-insecure -v -ddd
# different tab
$ export https_proxy=http://localhost:8080/
$ export NODE_TLS_REJECT_UNAUTHORIZED=0
$ cdk deploy ...
After 1200 bytes sent by the client, the server decides they don't like the client. Wonder what takes 1200 bytes to send in a TLS connection... maybe mitmproxy can tell us.
Alright, this is pretty wild... When I have the mitmdump proxy running, I don't receive the ECONNRESET error and it deploys correctly. Not sure if it is the mitmdump process or the exports you had me put in. 🤣
FYI I had to remove the -ddd as it was not being accepted.
FYI I had to remove the -ddd as it was not being accepted.
Ah sorry, it might have been -vvv
Alright, this is pretty wild... When I have the mitmdump proxy running, I don't receive the ECONNRESET error and it deploys correctly. Not sure if it is the mitmdump process or the exports you had me put in. 🤣
Most likely it's because it's a different process doing the actual connection over the VPN to AWS; it's now the mitmproxy
doing the actual connection, instead of node
.
That's terrible though, without being able to inspect the actual data stream this is going to be impossible to debug.
Could this be related? https://github.com/nodejs/node/issues/36826
Are you on M1 or Intel architecture?
Could this be related? nodejs/node#36826
Are you on M1 or Intel architecture?
We have been all on Intel thus far.
It does look similar enough that I wonder if it could be related.
Here is the same test wireshark conversation, working in Catalina, with the cisco stuff running.
The sequence numbers in that trace are slightly off from the other one. For whatever reason, the sequence numbers of packets in the general vicinity of where the Big Sure trace would cut off are [419, 545, 1089]
instead of [519, 1199]
. No idea if that's even relevant, though.
@jsauter, I'm very sorry, but I don't know how to help anymore.
@jsauter, I'm very sorry, but I don't know how to help anymore.
Did you all open a ticket with Apple and/or Cisco?
You might also trying using https://mitmproxy.org/ to try and capture the traffic in readable form and seeing if that turns anything up.
Quick-fire instructions:
$ pip3 install mitmproxy $ mitmdump -p 8080 --ssl-insecure -v -ddd # different tab $ export https_proxy=http://localhost:8080/ $ export NODE_TLS_REJECT_UNAUTHORIZED=0 $ cdk deploy ...
This is really interesting @rix0rrr. I was able to publish and deploy all assets using the proxy as you have described. Without the proxy I get socket hangup
or Inaccessible host ...
errors
OSX: 11.4 (20F71) Node: v14.15.3 CDK: 1.85
Will use this as a workaround till I figure it out 🙈
@SeriousAnt, same. The proxy works for me, and I'll use it as a workaround for now.
Did you all open a ticket with Apple and/or Cisco?
No, we need an internal Amazonian to get in touch with our client engineering folk to set up a useful trace, and so far no one has stepped up.
Deployment using
cdk deploy
fails withfail: socket hang up
. This has started happening since upgrading to Big Sur (MacOS). I opened a support ticket with AWS, and they directed me to here.I have upgraded
npm
andnode
and aws utils to the latest versions as asked by the AWS support.Reproduction Steps
cdk deploy
Output: -
What did you expect to happen?
Successful deployment of artifacts.
What actually happened?
The deployment fails due to upload timeout from cdk to S3
Environment
Other
This is :bug: Bug Report