apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.32k stars 14.34k forks source link

SFTPHook cannot download large files #16286

Closed randr97 closed 2 years ago

randr97 commented 3 years ago

Apache Airflow version: 2.0.1 (Should apply to previous versions and later ones as well)

Environment:

What happened: In airflow.providers.sftp.hooks.sftp.SFTPHook, when we try to download a file greater than 18 MiB, the download keeps happening forever and never gets completed.

What you expected to happen: The download should have completed in seconds but did not. A file less than 18MiB gets downloaded in few seconds. Looks like this is an underlying issue in the paramiko library. Attaching a bunch of issues on paramiko's git and stackoverflow -

  1. https://github.com/paramiko/paramiko/issues/926
  2. https://stackoverflow.com/questions/12486623/paramiko-fails-to-download-large-files-1gb
  3. https://stackoverflow.com/questions/3459071/paramiko-sftp-hangs-on-get

How to reproduce it:

  1. Create a large file size > 18MiB
  2. Dump it in an SFTP server
  3. Use airflow SFTPHook to download it
  4. You should be able to see the task run forever

Anything else we need to know: I after exploring found a solution to the problem and have fixed it in my project but if the community can dive deep it would be great. Link to the solution is - https://gist.github.com/vznncv/cb454c21d901438cc228916fbe6f070f This gist is by @vznncv and credits to him for coming up with a solution.

potiuk commented 3 years ago

Maybe we should consider rewriting the hook using Twisted? It sems all APIs needed are there https://twistedmatrix.com/documents/current/api/twisted.conch.ssh.filetransfer.FileTransferClient.html, Twisted seems to be a bit more modern with the async approach and the bug in paramiko seems to be around since 2017.

randr97 commented 3 years ago

@potiuk that would be definitely a good place to start. How do you want to go about this? I would definitely want to contribute to this.

potiuk commented 3 years ago

@potiuk that would be definitely a good place to start. How do you want to go about this? I would definitely want to contribute to this.

I do not think there is anything special needed. Just rewrite the Hook/Operator, starting with replacing the twisted library in setup.py deps instead of paramiko/sftp. Then we can relase a major release of SFTP provider with it. That's pretty much it :)

randr97 commented 3 years ago

@potiuk will take some time out and start working on the PR. Thanks

malthe commented 3 years ago

@potiuk what about https://github.com/ParallelSSH/parallel-ssh#sftp – ?

See also https://github.com/ParallelSSH/parallel-ssh#why-this-library:

Because other options are either immature, unstable, lacking in performance or all of the aforementioned.

Certain other self-proclaimed leading Python SSH libraries leave a lot to be desired from a performance and stability point of view, as well as suffering from a lack of maintenance with hundreds of open issues, unresolved pull requests and inherent design flaws.

ashb commented 3 years ago

I'd probably caution against introducing twisted -- with Python 3.6/3.7+ the built in asyncio can do most of twisted without the need for a large external dependency.

potiuk commented 3 years ago

Agree with @ashb . Parallel-ssh looks better as well. Seems rather popular from the number of forks, has very little number of dependencies (gevent, python-ssh2). Not super active (but what activity you'd expect from such library?).

uranusjr commented 3 years ago

Note that parallel-ssh is not small either (mainly because it ships its own copy of libssh and libssh2).

$ pip wheel -w tw twisted
...
$ pip wheel -w ps parallel-ssh
...
$ du -s tw ps
4336    tw
18132   ps

Parallel-ssh is over four times larger than Twisted if dependencies are considered.

potiuk commented 3 years ago

Hmm :( tough choice :)

malthe commented 3 years ago

@uranusjr you can build it from source, against system-provided libraries.

But there's something wrong in the packaging because when I look at the installed files, there are lots of .c source files:

image
randr97 commented 3 years ago

I think this solution Link can be a fix without really changing any libraries... Have been using it for sometime and haven't gotten into any issues till now! What do you guys think?

eladkal commented 2 years ago

Have been using it for sometime and haven't gotten into any issues till now! What do you guys think?

If you have a proposed solution - it's better to open a PR and ask for code review

potiuk commented 2 years ago

Seems that the issue might be closed in the upcoming sftp provider.

agreenburg commented 1 year ago

I'm still encountering this issue with the new provider (which uses paramiko).

I see some workarounds in this thread but the issue is still open: https://github.com/paramiko/paramiko/issues/926