apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.46k stars 3.52k forks source link

[R] Support for HTTPS Filesystem access #18980

Open asfimport opened 2 years ago

asfimport commented 2 years ago

Thanks for such an amazing project. I've been entirely blown away by the S3 Filesystem access in the latest release; and excited to see other backends like Azure being discussed in the issues.  As you know, many https clients also permit range requests, meaning (I think) that it should be possible to access public data (parquet, csv files) over generic HTTPS connections too.

As you probably know, duckdb already has support for https based remote file access, e.g. https://github.com/duckdb/duckdb/blob/master/test/sql/copy/parquet/test_parquet_remote.test

 (though it is not available out-of-the-box in the R client there either).

 

It would be wonderful to have a similar remote filesystem access that could work over HTTPS like that in arrow.  (I gather on the python side, fsspec already gives access to a wide number of such abstractions, but we're more limited in R so far, except for the geospatial data, where bindings to GDAL mean we can access GDAL's rather amazing virtual file systems over https, S3, FTP, etc, https://gdal.org/user/virtual_file_systems.html – a nice array-data complement to the more database-oriented workflow of arrow...).

 

Thanks for considering!

Reporter: Carl Boettiger / @cboettig

Related issues:

Note: This issue was originally created as ARROW-14998. Please see the migration documentation for further details.

asfimport commented 2 years ago

David Li / @lidavidm: This seems pretty similar to ARROW-7594 - does that sound right to you?

asfimport commented 2 years ago

Carl Boettiger / @cboettig: I think so, thanks for sharing the link!

 

Really though it depends how it's implemented!  e.g. a comment there notes other libraries in R that can take URLs, but some of these download the data first to /tmp, which isn't really the same thing.  I see you already have some nice discussion over in that thread on implementation – it's a bit over my head, but I think it would be worth comparing notes on how this is done already in other major/widely used projects like duckdb, fsspec, and gdal?

asfimport commented 2 years ago

David Li / @lidavidm: The comment there is only pointing out that there are things in R/Python that are analogous, I think, we would not implement it on top of that. If you have ideas or experience with other systems, it would be much appreciated in the other thread - I'm going to leave this one open for any R-specific things that might need to happen and link them.

asfimport commented 2 years ago

Zac Davies: This would be great, as far as I can tell, this is required to access pre-signed S3 URL's as a Dataset without the need to download/sync all files so the Dataset is on the local filesystem.