dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.65k stars 176 forks source link

Feat/1730 extend filesystem sftp #1769

Closed donotpush closed 2 months ago

donotpush commented 2 months ago

Description

Extend filesystem source and destination to work with sftp.

As a user, I want to be able to load data from and to sftp server. Probably it can be done already with fsspec

A few implementation hints. Look in fsspec_filesystem:

  1. we use last modified / created timestamp for incremental loading. use MTIME_DISPATCH to define a mapping (I hope sftp fsspec has it - upload time should be available)
  2. check FilesystemConfiguration on how we add credentials for particular filesystems (based on protocol)

tests:

Related Issues

netlify[bot] commented 2 months ago

Deploy Preview for dlt-hub-docs ready!

Name Link
Latest commit 7a43d3769b7d62dac6810d806e109d9eba77216c
Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/66e54a2f7450020008f6e6a8
Deploy Preview https://deploy-preview-1769--dlt-hub-docs.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

donotpush commented 2 months ago

high quality PR! code and sftp specific tests LGTM. what we need to do is to integrate those tests with other filesystem tests. I think we won't need a separate github ci workflow. we'll also reuse a lot of existing general purpose tests that ie. test the glob funtionality

please try the following:

  1. in tests/.dlt/config.toml add bucket url to the test sftp server
  2. add sftp to ALL_FILESYSTEM_DRIVERS and WITH_GDRIVE_BUCKETS (tests/load/utils)

now all tests that test buckets directly will see sftp.

ie.

@pytest.mark.parametrize("write_disposition", ("replace", "append", "merge"))
@pytest.mark.parametrize("layout", TEST_FILE_LAYOUTS)
def test_successful_load(write_disposition: str, layout: str, with_gdrive_buckets_env: str) -> None:

there are plenty of tests that run filesystem as destination. those are also enabled by the above.

you can skip the whole module (I'm referring to test_filesystem_sftp). see how we skip inactive destinations here:

def skip_if_not_active(destination: str) -> None:

and do the same for filesystems.

when this is done you can merge your test_destination_sftp CI workflow with test_local_destinations.

just enable sftp via

ALL_FILESYSTEM_DRIVERS: "[\"memory\", \"file\"]"

start the sftp server and stuff should run

@rudolfix thanks for the review! setting up the sftp server was fun. I've addressed all the requested changes - could you take another look and let me know if I missed anything?