apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.36k stars 3.49k forks source link

[C++] Push down projection and selection to S3 Select #18506

Open asfimport opened 3 years ago

asfimport commented 3 years ago

Amazon S3 Select [1], an S3 feature generally available since April 2018 [2], can improve S3 read performance by allowing S3 clients to use a limited subset of SQL to specify projection and selection [3] on data in some formats [4]. It would be interesting to try using this in Arrow and to measure its effects on S3 read performance under various conditions.

[1] https://aws.amazon.com/blogs/aws/s3-glacier-select/

[2] https://aws.amazon.com/about-aws/whats-new/2018/04/amazon-s3-select-is-now-generally-available/

[3] https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html

[4]https://docs.aws.amazon.com/cli/latest/reference/s3api/select-object-content.html

Reporter: Ian Cook / @ianmcook

Note: This issue was originally created as ARROW-11558. Please see the migration documentation for further details.

asfimport commented 3 years ago

Will Jones / @wjones127: I am also interested to see what situations s3 select is beneficial.

However, I wonder which parts do you think should be part of the Arrow library?

From what I can tell, the S3 Select endpoints give back a stream of JSON or CSV, which you could probably deserialize with the existing Arrow JSON and CSV readers. So this might be functionality you could build using Arrow, rather than need to build into Arrow. In fact, much of this might be more appropriate to have in AWS Data Wrangler, which already uses the Arrow library for reading parquet from S3.

asfimport commented 3 years ago

Ian Cook / @ianmcook: Thanks @wjones127 —yep, apparently the S3 Select output serialization formats are currently limited to CSV and JSON. I followed this chain of links to confirm this:

  1. S3 Select user guide: https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html
  2. SelectObjectContent API reference page: https://docs.aws.amazon.com/AmazonS3/latest/API/API_SelectObjectContent.html
  3. OutputSerialization API reference page: https://docs.aws.amazon.com/AmazonS3/latest/API/API_OutputSerialization.html (see only CSV and JSON listed there)

    This combined with the limited set of object data file formats, encodings, and compression formats that S3 Select supports certainly makes the practical applications of S3 Select within Arrow fairly narrow. However it might still be worth considering whether there are some cases in which it could improve the speed and cost of retrieving data from S3 in cases where Arrow is running outside AWS—for example, in cases where the user wants to use Arrow to select very small numbers of records/fields from very large sets of data files. But it might be that the complexity of implementing this in Arrow is not warranted given the narrow range of practical applications.

     

asfimport commented 3 years ago

Ian Cook / @ianmcook: Amazon S3 Object Lambda (announced today at https://aws.amazon.com/blogs/aws/introducing-amazon-s3-object-lambda-use-your-code-to-process-data-as-it-is-being-retrieved-from-s3/) seems like a better way to achieve the goals described here.

deepzliu commented 1 year ago

hi, any progress on this issue?

westonpace commented 1 year ago

hi, any progress on this issue?

No, as @ianmcook summarized:

This combined with the limited set of object data file formats, encodings, and compression formats that S3 Select supports certainly makes the practical applications of S3 Select within Arrow fairly narrow

Given this, I don't think there has been much motivation to try it.