apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.69k stars 3.56k forks source link

[C++] Add read/write optimization for pyarrow.fs.S3FileSystem #33169

Open asfimport opened 2 years ago

asfimport commented 2 years ago

I found large differences in loading time, when loading data  from aws s3 using pyarrows.fs.S3FileSystem compared to s3fs.S3FileSystem See example below.

The difference comes from s3fs optimization, which pyarrow.fs is not (yet) using.


import pyarrow.dataset as ds
import pyarrow.parquet as pq
import pyarrow.fs as pafs
import s3fs
import load_credentials

credentials = load_credentials()
path = "path/to/data" # folder with about 300 small (~10kb) files

fs1 = s3fs.S3FileSystem(
    anon=False,
    key=credentials["accessKeyId"],
    secret=credentials["secretAccessKey"],
    token=credentials["sessionToken"],
)

fs2 = pafs.S3FileSystem(
    access_key=credentials["accessKeyId"],
    secret_key=credentials["secretAccessKey"],
    session_token=credentials["sessionToken"],
   
)

_ = ds.dataset(path, filesystem=fs1).to_table() # takes about 5 seconds

_ = ds.dataset(path, filesystem=fs2).to_table() # takes about 25 seconds

_ = pq.read_table(path, filesyste=fs1) # takes about 5 seconds

_ = pq.read_table(path, filesytem=fs2) # takes about 10 seconds

 

Reporter: Volker Lorrmann

Externally tracked issue: https://github.com/apache/arrow/issues/14336

Note: This issue was originally created as ARROW-17961. Please see the migration documentation for further details.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Which "optimization" is that?

asfimport commented 2 years ago

Jacob Wujciak / @assignUser: Things like readahead and metadata caching cc @lidavidm for details

asfimport commented 2 years ago

David Li / @lidavidm: @westonpace probably has better context here, but from what I understand, s3fs does readahead by default; PyArrow's filesystems do not. And since I don't think we enable pre-buffering by default, and the Parquet reader issues a separate I/O call for each column chunk, that's O(row groups * columns) read operations, which presumably get absorbed by s3fs's readahead, but which lead to individual HTTP requests on the PyArrow filesystem. (This is mostly an educated guess, I haven't actually sat down and profiled.)

asfimport commented 2 years ago

David Li / @lidavidm: That said assuming that is the cause, I don't think we necessarily want to implement readahead, I think we just need to have better integration between the file readers and the I/O layer. Somewhat related, ARROW-17917 and ARROW-17913 as similar issues, where the I/O strategy needs to depend on the characteristics of the underlying filesystem

asfimport commented 2 years ago

Weston Pace / @westonpace: I think David's right. If you know you're going to read the entire parquet file then you can be more efficient about it. If the file is only 10kb then for peak performance you should only issue one read request.

However, this will use much more RAM if you have large files (e.g. multiple GBs) and will have worse performance if you only want to read parts of those large files (e.g. column selection).

So I agree there is room for optimization. It's just not going to be clear cut and simple.

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: We (somewhat inconsistently) enabled pre-buffer by default in pq.read_table, but not in ds.dataset. So I suppose that is also the difference you see between those (from 25 to 10s using pyarrow.fs.S3FileSystem), but so even with pre-buffer enabled, it's still a bit slower than s3fs (for this specific case of course).