SCP "connection corrupted" on large file transfer

edwdev commented 4 years ago

When trying to SCP from a local machine to and EC2 instance, using session manager (ProxyCommand is setup in SSH config) - it seems that I have issues sending larger files (60MB for example) and when I get to 8624KB uploaded, I get the following error.

ad packet length 2916819926. ssh_dispatch_run_fatal: Connection to UNKNOWN port 65535: Connection corrupted lost connection

My local connection is stable, so am just wondering if I'm hitting any hard limits ?

asterikx commented 4 years ago

I'm experiencing the same error when SSHing to an EC2 instance to access my database. The error occurs after some period of time (~30 minutes).

Bad packet length 2089525630.
ssh_dispatch_run_fatal: Connection to UNKNOWN port 65535: Connection corrupted

@edwdev could you solved the issue?

rscottwatson commented 4 years ago

I would like to know more about this as I am facing a similar issue as well. I am not transferring a file but just using session manager to access my RDS database and I frequently get this. I am connecting from my MAC incase there is something with MACs that is causing this issue.

asterikx commented 4 years ago

@rscottwatson I'm using macOS as well, still experiencing the issue

rscottwatson commented 4 years ago

@asterikx So I added the following to my .ssh/config file and I have not had seen that error yet.

Host i- mi- ServerAliveInterval 300 ServerAliveCountMax 2

No guarantees that this will help but so far it has allowed my connection to stay open longer than it has.

mikelee3 commented 4 years ago

This is happening to me as well, usually during or after a large file transfer. It is intermittent.

nunofernandes commented 3 years ago

I also have this issue. Some times it works while other (most of them) doesn't. Already tried @rscottwatson suggestion but got the same results..

PenelopeFudd commented 3 years ago

I'm getting this issue too. I already had ServerAliveInterval & ServerAliveCountMax set. Need to see what data ssh is receiving that is corrupt; I imagine it's a debug or an info message from SSM.

matt-brewster commented 3 years ago

I'm getting this error regularly too, I'm only using SSH (not SCP) and my connections randomly drop after some time with Bad packet length 2704720976.

m-norii commented 3 years ago

I also have this issue, using SSH (not SCP).

Bad packet length 840196936.
ssh_dispatch_run_fatal: Connection to UNKNOWN port 65535: Connection corrupted

ServerAliveInterval & ServerAliveCountMax also set:

    ServerAliveInterval 60
    ServerAliveCountMax 5

taherbs commented 3 years ago

I'm experiencing the same behaviors indirectly when using git, trying to clone a very large repository:

$ git fetch origin develop --depth=1
remote: Enumerating objects: 8778, done.
remote: Counting objects: 100% (8778/8778), done.
remote: Compressing objects: 100% (1933/1933), done.
ssh_dispatch_run_fatal: Connection to IP port 22: message authentication code incorrect
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: index-pack failed

taherbs commented 3 years ago

For my case, the issue was related to instable VPN connection

archisgore commented 3 years ago

I get the same with ssh not necessarily scp.

I have one terminal that's connected that stays connected.

Other terminals don't connect with the error above:

archisgore@Archiss-MacBook keys % ssh -i <key redacted>.pem ec2-user@i-<instance redacted>  
Bad packet length 631868568.
ssh_dispatch_run_fatal: Connection to UNKNOWN port 65535: Connection corrupted

I tried all the above: ClientAliveInterval, ClientAliveCountMax, ServerAliveInternal, ServerAliveCountMax. I restarted amazon-aws-ssm, I restarted sshd.

What's more, I also opened port 22 using a security group and tried connecting over direct ssh, and get this:

archisgore@Archiss-MacBook keys % ssh -v -i "<key redacted>.pem" ec2-user@ec2-<ip redacted>.us-west-2.compute.amazonaws.com
OpenSSH_8.1p1, LibreSSL 2.7.3
debug1: Reading configuration data /Users/archisgore/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 47: Applying options for *
debug1: Connecting to ec2-52-11-122-217.us-west-2.compute.amazonaws.com port 22.
debug1: Connection established.
debug1: identity file <key redacted>.pem type -1
debug1: identity file <key redacted>.pem-cert type -1
debug1: Local version string SSH-2.0-OpenSSH_8.1
debug1: Remote protocol version 2.0, remote software version OpenSSH_8.0
debug1: match: OpenSSH_8.0 pat OpenSSH* compat 0x04000000
debug1: Authenticating to ec2-<ip redacted>.us-west-2.compute.amazonaws.com:22 as 'ec2-user'
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: algorithm: curve25519-sha256
debug1: kex: host key algorithm: ecdsa-sha2-nistp256
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: Server host key: ecdsa-sha2-nistp256 SHA256:IfH362yMWxH3EWdQZOTgsyuq0+jjdJvg0Ag+nFQPjvs
debug1: Host 'ec2-<ip redacted>.us-west-2.compute.amazonaws.com' is known and matches the ECDSA host key.
debug1: Found key in /Users/archisgore/.ssh/known_hosts:109
debug1: rekey out after 134217728 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: rekey in after 134217728 blocks
debug1: Will attempt key: PolyverseDevelopmentKey.pem  explicit
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_input_ext_info: server-sig-algs=<ssh-ed25519,ssh-rsa,rsa-sha2-256,rsa-sha2-512,ssh-dss,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521>
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic
debug1: Next authentication method: publickey
debug1: Trying private key: PolyverseDevelopmentKey.pem
Connection closed by 52.11.122.217 port 22
archisgore@Archiss-MacBook keys % ssh -v -i "<key redacted>.pem" ec2-user@ec2-<ip redacted>.us-west-2.compute.amazonaws.com
OpenSSH_8.1p1, LibreSSL 2.7.3
debug1: Reading configuration data /Users/archisgore/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 47: Applying options for *
debug1: Connecting to ec2-<ip redacted>.us-west-2.compute.amazonaws.com port 22.
debug1: Connection established.
debug1: identity file <key redacted>.pem type -1
debug1: identity file <key redacted>.pem-cert type -1
debug1: Local version string SSH-2.0-OpenSSH_8.1
debug1: Remote protocol version 2.0, remote software version OpenSSH_8.0
debug1: match: OpenSSH_8.0 pat OpenSSH* compat 0x04000000
debug1: Authenticating to ec2-<ip redacted>.us-west-2.compute.amazonaws.com:22 as 'ec2-user'
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: algorithm: curve25519-sha256
debug1: kex: host key algorithm: ecdsa-sha2-nistp256
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: Server host key: ecdsa-sha2-nistp256 SHA256:IfH362yMWxH3EWdQZOTgsyuq0+jjdJvg0Ag+nFQPjvs
debug1: Host 'ec2-<ip redacted>.us-west-2.compute.amazonaws.com' is known and matches the ECDSA host key.
debug1: Found key in /Users/archisgore/.ssh/known_hosts:109
debug1: rekey out after 134217728 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: rekey in after 134217728 blocks
debug1: Will attempt key: <key redacted>.pem  explicit
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_input_ext_info: server-sig-algs=<ssh-ed25519,ssh-rsa,rsa-sha2-256,rsa-sha2-512,ssh-dss,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521>
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic
debug1: Next authentication method: publickey
debug1: Trying private key: <key redacted>.pem
Connection closed by <ip redacted> port 22

sruthi-maddineni commented 3 years ago

Thanks for reaching out to us! Please provide SSM Agent and Session Manager Plugin logs for the failed session to help investigate the issue. You can refer to below documentation to retrieve SSM Agent logs. https://docs.aws.amazon.com/systems-manager/latest/userguide/sysman-agent-logs.html

Documentation for enabling plugin logs. https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html#install-plugin-configure-logs

archisgore commented 3 years ago

Oh hey sorry I take that back. My issue was unrelated. My SSH config was broken, which I verified by opening a direct SSH port. Once I fixed that, SSM worked too. Apologies for the report above.

brainstorm commented 3 years ago

Sorry if I'm going slightly offtopic here, but would AWS consider supporting SCP-like functionality natively into the SSM client/cmdline? Case in point:

https://twitter.com/braincode/status/1427841930596032513

Our users would migrate fully to SSM if there was such a file transfer facility built-in and as "cloud admins" we'd be happier because we don't have to open SSH ports anymore, all is handled via the official AWS CLI?

Aagam15 commented 2 years ago

Is there any solution to this issue? I still see it for long running connections.

sandangel commented 1 year ago

hi, may I ask if we have an update on this issue?

Alexis-D-ff commented 1 year ago

I experienced this on MacOS:

Bad packet length XXXXXXX.
ssh_dispatch_run_fatal: Connection to UNKNOWN port 65535: Connection corrupted

To solve it, I have installed the last version of the OpenSSH with brew: brew install openssh Verify your current version with ssh -V, it should be updated after brew installation.

bryceml commented 1 year ago

I'm still getting this on MacOS with the latest version of openssh available from brew:

ssh -V
OpenSSH_9.3p1, OpenSSL 3.1.1 30 May 2023

Bad packet length 3246692971.
ssh_dispatch_run_fatal: Connection to UNKNOWN port 65535: Connection corrupted

When it happened, I wasn't transferring anything large, it was just sitting there idle.

xanather commented 10 months ago

This issue is caused by the SSM Maximum Session Timeout on the SSM config found at: https://us-east-1.console.aws.amazon.com/systems-manager/session-manager/preferences?region=us-east-1 (change your region)

The SSM service will terminate the stream after this timeout, causing this 'packet length' error from SSH logs regardless of what is in progress presumably due to the multi-layering at the TCP application-layer.

As far as I can see, this is working as expected. Make sure to also enable SSH keep alives so the SSM Idle Session Timeout config also doesn't come into effect but nothing can prevent SSM Maximum Session Timeout config from killing it later on.

This can probably be closed unless someone is experiencing this problem before the maximum timeout is reached (by default its 20 minutes).

Cheers

Tipmethewink commented 10 months ago

Given @xanather 's comment, I've just installed the latest version and a quick test (scp of large file and dd if=/dev/zero | ssh host dd of=/tmp/zero) ilicit's no hangups, thanks.

pseudometric commented 8 months ago

The SSM service will terminate the stream after this timeout, causing this 'packet length' error from SSH logs regardless of what is in progress presumably due to the multi-layering at the TCP application-layer.

While the original issue reported here may have been due to the SSM timeout, I don't think this explanation makes sense. The bad packet length error happens when the client receives garbage in the protocol stream, specifically where it expects the plaintext packet-length field of the SSH binary packet protocol. The garbage gets interpreted as an integer, usually one that's absurdly large (see this FAQ).

If the SSM session times out, the aws ssm ... command should simply exit, perhaps writing some error messages to stderr. It should in no case emit data to stdout that it didn't receive; similarly the SSM service should just close its TLS connection, not send garbage user data inside that connection as user data which the server (sshd) never sent.

I think there has to be some other bug at work here to trigger that error, which should just never normally happen.

msolters commented 8 months ago

To @pseudometric's point above, the presence of the bad packet length is almost certainly due to this issue: https://github.com/aws/amazon-ssm-agent/issues/358

It seems any kind of ~amazon-ssm-agent~ session-manager-plugin debug statement is currently going to stdout, instead of stderr; this causes the debug statement content to be injected into the SSH channel, which is not expected and yields the corruption error.

The real question is probably -- what debug statement is firing? And would it shed light on the real problem? If the statement is not a fatal error message, it should not cause the connection to be corrupted.

asharpe commented 5 months ago

Thanks to

@AWS for making this possible in the first place
@edwdev for reporting this issue as you saw it - arguably the most important contribution!
@pseudometric for calling out some important detail around the issue that AWS are not interested in acknowledging
whoever else I might have missed that has raised this with their AWS rep (as I have) and received platitudes and no action in response

The remainder is addressed squarely to AWS...

There is no reliable reproduction to this issue thus far, so "I updated and it's fixed" is not validation of a fix. This kind of issue MUST (https://datatracker.ietf.org/doc/html/rfc2119) be understood and solved at the source.

This issue has plagued AWS customers for quite some time now and most have found a workaround. Unreliable behaviour of computers it not a thing most professionals want to deal with, especially if we're paying. If this method is not fit for purpose, please, retire it, or at minimum state as much in your documentation.

AWS staff have surely had this issue when attempting to use this transport method, so it's clear you either know about it and avoid it, or you just retry until you succeed. Neither are appropriate responses to paying customers.

Due to the intermittency of this issue it is a non-trivial problem to solve, however the only people with appropriate access to do so are AWS employees. This is NOT a problem with SSH, this is clearly a problem with your ssm-agent.

AWS, you alone have the means to find and fix this!

Thanks in advance, A

xanather commented 5 months ago

Thanks to
* @aws for making this possible in the first place

* @edwdev for reporting this issue as you saw it - arguably the most important contribution!

* @pseudometric for calling out some important detail around the issue that AWS are not interested in acknowledging

* whoever else I might have missed that has raised this with their AWS rep (as I have) and received platitudes and no action in response
The remainder is addressed squarely to AWS...

There is no reliable reproduction to this issue thus far, so "I updated and it's fixed" is not validation of a fix. This kind of issue MUST (https://datatracker.ietf.org/doc/html/rfc2119) be understood and solved at the source.

This issue has plagued AWS customers for quite some time now and most have found a workaround. Unreliable behaviour of computers it not a thing most professionals want to deal with, especially if we're paying. If this method is not fit for purpose, please, retire it, or at minimum state as much in your documentation.

AWS staff have surely had this issue when attempting to use this transport method, so it's clear you either know about it and avoid it, or you just retry until you succeed. Neither are appropriate responses to paying customers.

Due to the intermittency of this issue it is a non-trivial problem to solve, however the only people with appropriate access to do so are AWS employees. This is NOT a problem with SSH, this is clearly a problem with your ssm-agent.

AWS, you alone have the means to find and fix this!

Thanks in advance, A

I'm not defending AWS at their lack of support here. But take a look at my comment further above. You can make it consistent by enabling SSH probing and setting the SSM timeouts. It is caused by the SSM session timing OUT. The error message is a different issue. This has worked reliably since I added my comment 5 months ago.

asharpe commented 5 months ago

Fair call, I will attempt to replicate (as AWS should), but I hold that an intermittent issue it not solved by magical updates without explanation

Edit:

@xanather a couple of questions to help me know what I'm looking for, if you will...

if the timeout from the link you shared (in the right region) is 20 minutes, should I expect my transfer to fail at 20 minutes (or near enough to)?
if not, is there another 20 minute interval I should expect the timeout at, and where might that be?
do you have SSH configuration guidance I should use to replicate this?

I'm attempting to replicate by transferring a large file (ubuntu-budgie-22.04.1-desktop-amd64.iso) multiple times, and my expectation is it will fail at some unknown point, after some unknown amount of transfers, not related to any obvious timeout (unless you can guide me better?). To be clear, some number of transfers will succeed, and at some point it would be silly to not concede that it "just works", but I'm not sure where that mark should be!

For some context: I have found a workaround since some time ago, and no timeouts have been changed at any point in time, and the value there is currently 20 minutes. I've had many a transfer succeed, and also some random failures with the symptoms presented in the OP, sometimes well short of "the timeout", and some transfers that survive much longer. I'm hopeful that the issue @msolters has linked (#358) may help here, but there is a component in here that I don't believe is open source, and it's been more than long enough for AWS to take this seriously for paying customers.

asharpe commented 5 months ago

So I attempted to reproduce the original issue using scp in a loop and have not yet been able to. This morning however, an idle SSH connection suffered from

$ Bad packet length 247184078.
ssh_dispatch_run_fatal: Connection to UNKNOWN port 65535: Connection corrupted

So while this doesn't strictly fit the issue reported in the OP, it matches the symptoms. FWIW, my SSH configuration for this connection includes

   ServerAliveInterval 30

So I would expect the SSH session to stay alive and not hit a timeout on the AWS side. I will try and replicate this again, but due to the intermittent nature I'm not expecting anything conclusive.

asharpe commented 4 months ago

aws/session-manager-plugin#94 appears to address the issue with SSH reporting a bad packet length, at least in some cases. The test applied was to intentionally shut down an EC2 instance while running a simple loop printing the date via SSH. More rigorous testing would be required to confirm this is a solution, but it looks promising.

Thanks to @msolters for linking up #358

msolters commented 4 months ago

As stated earlier, the bad packet length error is because there is garbage in the SSH protocol stream. The question then becomes, in all cases we see that message, how did there come to be garbage in the SSH protocol stream?

We certainly know that one cause is because amazon-ssm-agent is writing debug logs into stdout, which is where the protocol stream data is communicated to ssh(1). I can confirm this problem presents in all manner of situations. I have seen it happen in idle connections as mentioned just above. I have seen it happen in brand new connections. Any situation that would cause amazon-ssm-agent to emit a debug log will trigger this use case.

Possibly, SSM sessions timing out trigger debug log statements in amazon-ssm-agent, and may or may not be causally related to garbage in the protocol stream by that mechanism. I have not verified this hypothesis personally.

What I can confirm is that if you recompile amazon-ssm-agent with all debug log statements sent to stderr instead of stdout (for example with this MR), then this failure mode completely stops. I maintain and release an amazon-ssm-agent RPM with such changes built in to provide stable SSH-over-SSM for our internal environments.

The quiet mode MR looks to be a more robust implementation of that same fundamental fix -- get the garbage out of the protocol stream by moving it out of stdout. In this case, we just provide some additional runtime control over that behaviour.

I would love for this fix or some variant to be upstreamed! We should not have to maintain forks of amazon-ssm-agent, packaging and releasing it ourselves, to provide stable SSH-over-SSM.

NuwanUdara commented 1 month ago

We are encountering a random issue when using SSM proxy to SSH into EC2 instances in our CI/CD pipelines. The command used is:

ssh -i "$pem_file" ubuntu@$instance_id \
    -o ProxyCommand="aws ssm start-session --target $instance_id --document-name AWS-StartSSHSession --parameters 'portNumber=22' --region $region" \
    -o StrictHostKeyChecking=no \
    "$command"

The error message we occasionally receive is: "bad packet length."

Key details:

The issue appears randomly and cannot be reliably reproduced.
The SSH sessions are not idle when the error occurs.
For idle connections, the session remains active for quite some time without issues.
We have tried using ServerAliveInterval and ServerAliveCountMax, but these did not resolve the problem.

Has anyone else faced this issue? It's quite frustrating to deal with in a pipeline context. Any insights or solutions would be appreciated!

aws / amazon-ssm-agent

SCP "connection corrupted" on large file transfer #274