Open asfimport opened 2 years ago
David Li / @lidavidm: This seems pretty similar to ARROW-7594 - does that sound right to you?
Carl Boettiger / @cboettig: I think so, thanks for sharing the link!
Really though it depends how it's implemented! e.g. a comment there notes other libraries in R that can take URLs, but some of these download the data first to /tmp, which isn't really the same thing. I see you already have some nice discussion over in that thread on implementation – it's a bit over my head, but I think it would be worth comparing notes on how this is done already in other major/widely used projects like duckdb, fsspec, and gdal?
David Li / @lidavidm: The comment there is only pointing out that there are things in R/Python that are analogous, I think, we would not implement it on top of that. If you have ideas or experience with other systems, it would be much appreciated in the other thread - I'm going to leave this one open for any R-specific things that might need to happen and link them.
Zac Davies: This would be great, as far as I can tell, this is required to access pre-signed S3 URL's as a Dataset without the need to download/sync all files so the Dataset is on the local filesystem.
Thanks for such an amazing project. I've been entirely blown away by the S3 Filesystem access in the latest release; and excited to see other backends like Azure being discussed in the issues. As you know, many https clients also permit range requests, meaning (I think) that it should be possible to access public data (parquet, csv files) over generic HTTPS connections too.
As you probably know, duckdb already has support for https based remote file access, e.g. https://github.com/duckdb/duckdb/blob/master/test/sql/copy/parquet/test_parquet_remote.test
(though it is not available out-of-the-box in the R client there either).
It would be wonderful to have a similar remote filesystem access that could work over HTTPS like that in arrow. (I gather on the python side, fsspec already gives access to a wide number of such abstractions, but we're more limited in R so far, except for the geospatial data, where bindings to GDAL mean we can access GDAL's rather amazing virtual file systems over https, S3, FTP, etc, https://gdal.org/user/virtual_file_systems.html – a nice array-data complement to the more database-oriented workflow of arrow...).
Thanks for considering!
Reporter: Carl Boettiger / @cboettig
Related issues:
Note: This issue was originally created as ARROW-14998. Please see the migration documentation for further details.