[Python] Add filesystem capabilities to other IO formats (feather, csv, json, ..)

asfimport commented 4 years ago

In the parquet IO functions, we support reading/writing files from non-local filesystems directly (in addition to passing a buffer) by:

passing a URI (eg pq.read_parquet("s3://bucket/data.parquet"))
specifying the filesystem keyword (eg pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...)))

On the other hand, for other file formats such as feather, we only support local files or buffers. So for those, you need to do the more manual (I suppose this works?):
```
from pyarrow import fs, feather

s3 = fs.S3FileSystem()

with s3.open_input_file("bucket/data.arrow") as file:
  table = feather.read_table(file)
```
So I think the question comes up: do we want to extend this filesystem support to other file formats (feather, csv, json) and make this more uniform across pyarrow, or do we prefer to keep the plain readers more low-level (and people can use the datasets API for more convenience)?

cc @pitrou @kszucs

Reporter: Joris Van den Bossche / @jorisvandenbossche Assignee: Miles Granger / @milesgranger

_{Note: This issue was originally created as ARROW-9938. Please see the migration documentation for further details.}

asfimport commented 4 years ago

Antoine Pitrou / @pitrou: On the C++ side they will definitely stay more low-level. On the Python side, I have no preference. I suppose it could be useful to write open_csv("s3://...").

asfimport commented 4 years ago

Krisztian Szucs / @kszucs: Supporting remote URIs sounds like a nice feature.

asfimport commented 4 years ago

Neal Richardson / @nealrichardson: FTR I'm doing this in R in ARROW-9854, in case you want to see what this looks like in practice (https://github.com/apache/arrow/pull/8058)

asfimport commented 3 years ago

Hendrik Makait: Unless someone is already working on this, I'd love to get started on putting together a PR for this. Since it will be my first contribution, I might ask for guidance in the process. As a first question: Should I split this into multiple PRs per format (i.e. one PR for csv, feather, json, respectively) or combine them into one larger PR?

asfimport commented 3 years ago

Joris Van den Bossche / @jorisvandenbossche: Hi [~hendrik.makait] I don't think someone started on this, so a contribution would be very welcome! And happy to give some guidance where needed.

As a first question: Should I split this into multiple PRs per format (i.e. one PR for csv, feather, json, respectively) or combine them into one larger PR?

I would in any case start with a single format, and opening a PR for that, check if the approach is good, etc. Then we can still decide whether we want to add it for the other formats in the same PR or as separate PRs.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: cc @milesgranger, @AlenkaF. Nothing earth-shattering but perhaps a nice usability feature?

apache / arrow

[Python] Add filesystem capabilities to other IO formats (feather, csv, json, ..) #25967