scp_send silently fails to send all data for some targets (see bug #226)

wynnw commented 2 years ago

Describe the bug Same thing as the closed bug #226 - scp operations are not transferring the whole file at times. This bug was closed as can't reproduce. I realize this is really hard to deal with as this appears to be a timing issue with gevent greenlets and non-blocking I/O. I can't really debug it as just using a debugger introduces more time and then the problem never manifests.

To Reproduce

Steps to reproduce the behavior:

Example code that produces error.
Stack trace or error messages.

Expected behavior Destination files match source files.

Actual behaviour At least one target server's destination file is missing a certain number of files at the end. For a file of 102912000 bytes I typically have one or more servers that only have 102891520 bytes, and that smaller number is very consistent.

Screenshots If applicable, add screenshots to help explain your problem.

Additional information I've run into this a bunch today as I've begun using this library. It's hard to repro every time, but I've got a path where sending 100mb files to two hosts is missing the last few bytes on just one about 50% of the time. Still investigating, but this is a real issue...

If I add a sleep(2) at the end of the data loop in _scp_send I never have the problem, so that's interesting.

This is against openssh-server 7.4p1-21 running on amazonlinux2, using: parallel-ssh==2.8.0 ssh-python==0.9.0 ssh2-python==0.27.0

The virtualenv is using python 3.10.1

in a virtualenv. The code I'm running is just:

    77     def put(self, local, remote=None, preserve_mode=True):  # pylint: disable=arguments-differ
    78         # use a parallel-ssh group to do this
    79         from pssh.clients import ParallelSSHClient
    80         from gevent import joinall
    81         from logging import DEBUG
    82         from pssh.utils import logger, enable_logger
    83
    84         enable_logger(logger, DEBUG)
    85
    86         if not remote:
    87             remote = "/tmp/"
    88         if not local:
    89             raise ValueError("You must specify a filepath")
    90         if local[0] != "/":
    91             local = os.path.join(os.getcwd(), local)
    92         if not os.path.exists(local):
    93             raise ValueError(f"The specified filepath: {local} does not exist")                                                                                                                                                                                                                                                                                                          
    94
    95         hosts = []
    96         for cnxn in self:
    97             hosts.append(cnxn.host)
    98         pclient = ParallelSSHClient(hosts)
    99         ops = pclient.scp_send(local, remote)                                                                                                                                                                                                                                                                                                                                            
  100         joinall(ops, raise_error=False)

Using a file that is 99mb and sending to two servers. This is code integrated with some fabric2 code.

In further testing, I'm able to reproduce the problems with a 10561 bytes text file and 4 servers, and quite often or or two of the servers gets a 0 byte file. If I add the gevent sleep call in at the end of the _scp_send data/eagain_write loop like sleep(2) I never see the problem. Seems like a channel can get closed before all the data is flushed out?

This is an awesome library, you've done great work. I can't really use it though until I get this problem figured out as it happens at least 50% of the time.

wynnw commented 2 years ago

As a side note, I haven't been able to reproduce these errors with the asyncssh 2.9.0 project that uses asyncio instead of gevent.

JonasToth commented 2 years ago

I have seen a similar issue when upload a tar file to many devices. The files were for some of them consistently corrupted, all bad files resulted in the same hash value.

pkittenis commented 2 years ago

Hi there,

Thanks for the interest and report. Since there have been multiple reports on this issue, I do believe there is one. Having gone through the code again, can see where there is potential for a race condition which sounds like the behaviour here. The difference in the written vs total bytes is exactly 2k as well, which is the internal buffer size of libssh2. That can't be an accident.

I am curious if you see the same behaviour with copy_file, the SFTP copy. I'd expect not. And you cannot have used SCP with asyncssh since it does not support it, it uses SFTP.

Another question, are you able to reproduce this against a local SSH server or only remote? I still need to be able to reproduce in a test to confirm a fix works, that would help in doing so. The existing tests do not use such a large file size, will try again with 100MB.

wynnw commented 2 years ago

I can't use sftp with my systems, not sure why and I haven't deep dived on that. So I am using scp (was before with paramiko), and now I'm using scp with asyncssh as it does support it. https://asyncssh.readthedocs.io/en/latest/index.html#scp-client

wynnw commented 2 years ago

I've been focused on this inside aws. It reproduces very consistently for me within the same ec2 region and when going across ec2 regions. So that's more overhead than a local ssh server of course. I haven't tried on my own home LAN.

JonasToth commented 2 years ago

Thanks for the interest and report. Since there have been multiple reports on this issue, I do believe there is one. Having gone through the code again, can see where there is potential for a race condition which sounds like the behaviour here. The difference in the written vs total bytes is exactly 2k as well, which is the internal buffer size of libssh2. That can't be an accident.

Yes, that is approximately the amount of bytes missing in my case. I uploaded ~2MB tars and they were damaged only at the end. I verified that with a hexdump of the files. And the missing bytes werent that much, 2K seems about right.

The remote-side in my production-system use dropbear over a mobile connection. Maybe such an environment is more prone to the condition as well?

We use toxiproxy for some tests, maybe their rate-limiting helps with reproduction.

I am curious if you see the same behaviour with copy_file, the SFTP copy. I'd expect not. And you cannot have used SCP with asyncssh since it does not support it, it uses SFTP.

AFAIK our dropbear does not support sftp properly/there were issues. Thats why we used scp.

Another question, are you able to reproduce this against a local SSH server or only remote? I still need to be able to reproduce in a test to confirm a fix works, that would help in doing so. The existing tests do not use such a large file size, will try again with 100MB.

https://github.com/shopify/toxiproxy

Maybe you can play around with timings/bandwith using toxiproxy. Local setups might be "too good" under normal circumstances.

pkittenis commented 2 years ago

~~Resolved in 2.9.0.~~ Really resolved in 2.9.1.

Triggered by client going out of scope, which caused a race condition if client is not kept alive.

scp_send should also be a lot faster as well now, 10x or more.

Please reply on this issue if still having problems, will look into it asap.

wynnw commented 2 years ago

Nice fix - thanks!

ParallelSSH / parallel-ssh

scp_send silently fails to send all data for some targets (see bug #226) #337