Implement vectored IO in parquet file format

apache / parquet-java

Apache Parquet Java

https://parquet.apache.org/

Apache License 2.0

2.65k stars 1.41k forks source link

Implement vectored IO in parquet file format #2703

Closed asfimport closed 7 months ago

asfimport commented 2 years ago

We recently added a new feature called vectored IO in Hadoop for improving read performance for seek heavy readers. Spark Jobs and others which uses parquet will greatly benefit from this api. Details can be found here

https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5

https://issues.apache.org/jira/browse/HADOOP-18103

https://issues.apache.org/jira/browse/HADOOP-11867

Reporter: Mukund Thakur Assignee: Steve Loughran / @steveloughran

Related issues:

Bump hadoop.version from 3.2.3 to 3.3.5 (is blocked by)
High performance vectored read API in Hadoop (relates to)
Vectored Read into off-heap buffer broken in fallback implementation (is related to)
Vector IO: consistent specified rejection of overlapping ranges (is related to)
Improve Parquet IO Performance within cloud datalakes (is superceded by)
Utilize Hadoop vectorized APIs (is depended upon by)
PRs and other links:
GitHub Pull Request #1103
GitHub Pull Request #1139
GitHub Pull Request #1330

_{Note: This issue was originally created as PARQUET-2171. Please see the migration documentation for further details.}

asfimport commented 2 years ago

Mukund Thakur: CC [~stevel@apache.org]

asfimport commented 2 years ago

Timothy Miller / @theosib-amazon: This might synergize well with the bulk I/O features I've been adding to ParquetMR. Some of the initial work is already in some PRs, and the rest of the plan can be found at https://docs.google.com/document/d/1fBGpF_LgtfaeHnPD5CFEIpA2Ga_lTITmFdFIcO9Af-g/edit?usp=sharing

I determined what to optimize from profiling, and I have run experiments on the new implementation. I glanced through your Hadoop commits, and I noticed that you use ByteBuffer a lot. I have found ByteBuffer to impose a nontrivial amount of overhead, and you might want to consider providing array-based methods as well.

asfimport commented 2 years ago

Steve Loughran / @steveloughran:

I have found ByteBuffer to impose a nontrivial amount of overhead, and you might want to consider providing array-based methods as well.

mixed feelings. its hard to work with but some libraries (parquet...) love it, which partly drove our use of it. if you use on heap buffers is just arrays with more hassle.

FwIW, i was looking at some of the parquet read code and concluding that the s3a FS should implement read(bytebyffer) as a single vectored IO read. currently the base class implementation reads into a temp byte array and so breaks prefetching...the s3afs only sees the read(bytes) of the shorter array, not the full amount wanted

asfimport commented 2 years ago

Timothy Miller / @theosib-amazon: The parquet reader has two phases of reading. One does the raw I/O and decompression. Someone is working on an asynchronous implementation of this, which should help a lot. The second phase works on the output of that, providing higher-level data types. My PRs improve on this by eliminating LittleEndianInputStream, which was super inefficient, plus some other improvements in the most critical paths. All of these improvements are incremental, of course, and we're happy to get contributions that improve on this further.

asfimport commented 1 year ago

Steve Loughran / @steveloughran: mukund, is there a PR up for this? even though it's not going to be merged, it needs to be shared for others to pick up

asfimport commented 1 year ago

Mukund Thakur: Not yet Steve. I plan to do it soon.

apache / parquet-java

Implement vectored IO in parquet file format #2703

Related issues:

PRs and other links: