ParallelSSH / ssh2-python

Python bindings for libssh2 C library.
https://parallel-ssh.org
GNU Lesser General Public License v2.1
228 stars 70 forks source link

Greatly speed up sftp_write example for non-trivial files #161

Closed JacobCallahan closed 2 years ago

JacobCallahan commented 2 years ago

This change loads the entire file into memory and sends that to the sftp file handler's write. The effect of this is a single call to write instead of many. This greatly speeds up the overall sftp write at the expense of loading the entire file into memory.

In the screenshot below we have two test runs on a 3.8MB file from before the change and two test runs after the change.

Screenshot from 2021-11-05 09-46-24

JacobCallahan commented 2 years ago

Also, I think this is the true resolution of #66 @mflage

mflage commented 2 years ago

Also, I think this is the true resolution of #66 @mflage

I haven't tested, as I've moved on to different ways of implementing this. But wouldn't loading the entire file in memory be an issue if:

1) the file is bigger than the available memory+swap? 2) the file is so big that just the time to read it into memory would be a bottleneck in itself?

Regarding 2. From reading the commit, it seems like this reads the whole file into memory and then connects and starts uploading. So that's basically time wasted.

JacobCallahan commented 2 years ago

I haven't tested, as I've moved on to different ways of implementing this. But wouldn't loading the entire file in memory be an issue if:

  1. the file is bigger than the available memory+swap?

Yes. The person using this example would need to ensure it isn't being used for files large enough to consume all available memory.

  1. the file is so big that just the time to read it into memory would be a bottleneck in itself?

The read time is effectively constant, so it isn't a bottleneck.

Regarding 2. From reading the commit, it seems like this reads the whole file into memory and then connects and starts uploading. So that's basically time wasted.

I'd consider it less wasteful because it not only reduces the amount of time that a file handler is open, but also reduces the amount of time that the connection is active. This is true not just syntactically but from a pure performance aspect.

pkittenis commented 2 years ago

Thanks for the interest and PR.

Loading the entire file into memory is not sustainable, nor will it work for all files. A large enough file will eventually exhaust available memory.

There is already a buffering parameter on open. The code should be setting that to a larger buffer size if it wants to read larger sizes. It is a trivial change to the example.

The actual implementation in parallel-ssh already sets a large buffer on open which should be used instead.

JacobCallahan commented 2 years ago

Loading the entire file into memory is not sustainable, nor will it work for all files. A large enough file will eventually exhaust available memory.

Correct, and it shouldn't be up to you/ssh2-python to ensure people don't burst that limit. However, if you don't go with this implementation, then perhaps a note on the impact of the current method would help future users.

There is already a buffering parameter on open. The code should be setting that to a larger buffer size if it wants to read larger sizes. It is a trivial change to the example.

Again correct and again a valid alternative to the suggested changes in this PR. For those concerned about large file sizes, a larger buffer size would help in the case where you iterate over that buffer size. This would still, in effect, be slower than my version but more memory-friendly. In the current example, it relies on python's default line-based buffer size. When the current example was used on my test file [3.8MB over 29,888 lines], you can see just how slow the transfer was.

The actual implementation in parallel-ssh already sets a large buffer on open which should be used instead.

I'm not sure how many people use ssh2-python directly over parallel-ssh (like myself) that wouldn't see that version, but that would again be an improvement over the current example.

Basically, I'm fine with any way you would like to proceed as long as the example is updated to make it clear that small buffers would result in a drastic performance drop. I love this library and actively push it over paramiko :D [1]. I would just hate if someone used this example on a file similar to my test file and concluded that it is worse than the transfer speeds they would see when using paramiko or other alternative libraries.

[1] - https://www.youtube.com/watch?v=65wMDwsqF6w

pkittenis commented 2 years ago

Use parallel-ssh.

If you really really want to use this library, prepare to write a whole lot of code, and find ways to improve performance beyond what is offered by libssh2 alone.

Will update example separately. This is not the right approach and will create other bugs.