Open asfimport opened 4 years ago
Antoine Pitrou / @pitrou:
On the C++ side they will definitely stay more low-level. On the Python side, I have no preference. I suppose it could be useful to write open_csv("s3://...")
.
Krisztian Szucs / @kszucs: Supporting remote URIs sounds like a nice feature.
Neal Richardson / @nealrichardson: FTR I'm doing this in R in ARROW-9854, in case you want to see what this looks like in practice (https://github.com/apache/arrow/pull/8058)
Hendrik Makait: Unless someone is already working on this, I'd love to get started on putting together a PR for this. Since it will be my first contribution, I might ask for guidance in the process. As a first question: Should I split this into multiple PRs per format (i.e. one PR for csv, feather, json, respectively) or combine them into one larger PR?
Joris Van den Bossche / @jorisvandenbossche: Hi [~hendrik.makait] I don't think someone started on this, so a contribution would be very welcome! And happy to give some guidance where needed.
As a first question: Should I split this into multiple PRs per format (i.e. one PR for csv, feather, json, respectively) or combine them into one larger PR?
I would in any case start with a single format, and opening a PR for that, check if the approach is good, etc. Then we can still decide whether we want to add it for the other formats in the same PR or as separate PRs.
Antoine Pitrou / @pitrou: cc @milesgranger, @AlenkaF. Nothing earth-shattering but perhaps a nice usability feature?
In the parquet IO functions, we support reading/writing files from non-local filesystems directly (in addition to passing a buffer) by:
pq.read_parquet("s3://bucket/data.parquet")
)specifying the filesystem keyword (eg
pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))
)On the other hand, for other file formats such as feather, we only support local files or buffers. So for those, you need to do the more manual (I suppose this works?):
So I think the question comes up: do we want to extend this filesystem support to other file formats (feather, csv, json) and make this more uniform across pyarrow, or do we prefer to keep the plain readers more low-level (and people can use the datasets API for more convenience)?
cc @pitrou @kszucs
Reporter: Joris Van den Bossche / @jorisvandenbossche Assignee: Miles Granger / @milesgranger
Note: This issue was originally created as ARROW-9938. Please see the migration documentation for further details.