Open asfimport opened 2 years ago
Antoine Pitrou / @pitrou: Which "optimization" is that?
Jacob Wujciak / @assignUser: Things like readahead and metadata caching cc @lidavidm for details
David Li / @lidavidm:
@westonpace probably has better context here, but from what I understand, s3fs does readahead by default; PyArrow's filesystems do not. And since I don't think we enable pre-buffering by default, and the Parquet reader issues a separate I/O call for each column chunk, that's O(row groups * columns)
read operations, which presumably get absorbed by s3fs's readahead, but which lead to individual HTTP requests on the PyArrow filesystem. (This is mostly an educated guess, I haven't actually sat down and profiled.)
David Li / @lidavidm: That said assuming that is the cause, I don't think we necessarily want to implement readahead, I think we just need to have better integration between the file readers and the I/O layer. Somewhat related, ARROW-17917 and ARROW-17913 as similar issues, where the I/O strategy needs to depend on the characteristics of the underlying filesystem
Weston Pace / @westonpace: I think David's right. If you know you're going to read the entire parquet file then you can be more efficient about it. If the file is only 10kb then for peak performance you should only issue one read request.
However, this will use much more RAM if you have large files (e.g. multiple GBs) and will have worse performance if you only want to read parts of those large files (e.g. column selection).
So I agree there is room for optimization. It's just not going to be clear cut and simple.
Joris Van den Bossche / @jorisvandenbossche:
We (somewhat inconsistently) enabled pre-buffer by default in pq.read_table
, but not in ds.dataset
. So I suppose that is also the difference you see between those (from 25 to 10s using pyarrow.fs.S3FileSystem), but so even with pre-buffer enabled, it's still a bit slower than s3fs (for this specific case of course).
I found large differences in loading time, when loading data from aws s3 using
pyarrows.fs.S3FileSystem
compared tos3fs.S3FileSystem
See example below.The difference comes from
s3fs
optimization, whichpyarrow.fs
is not (yet) using.Reporter: Volker Lorrmann
Externally tracked issue: https://github.com/apache/arrow/issues/14336
Note: This issue was originally created as ARROW-17961. Please see the migration documentation for further details.