Closed asfimport closed 7 months ago
Mukund Thakur: CC [~stevel@apache.org]
Timothy Miller / @theosib-amazon: This might synergize well with the bulk I/O features I've been adding to ParquetMR. Some of the initial work is already in some PRs, and the rest of the plan can be found at https://docs.google.com/document/d/1fBGpF_LgtfaeHnPD5CFEIpA2Ga_lTITmFdFIcO9Af-g/edit?usp=sharing
I determined what to optimize from profiling, and I have run experiments on the new implementation. I glanced through your Hadoop commits, and I noticed that you use ByteBuffer a lot. I have found ByteBuffer to impose a nontrivial amount of overhead, and you might want to consider providing array-based methods as well.
Steve Loughran / @steveloughran:
I have found ByteBuffer to impose a nontrivial amount of overhead, and you might want to consider providing array-based methods as well.
mixed feelings. its hard to work with but some libraries (parquet...) love it, which partly drove our use of it. if you use on heap buffers is just arrays with more hassle.
FwIW, i was looking at some of the parquet read code and concluding that the s3a FS should implement read(bytebyffer) as a single vectored IO read. currently the base class implementation reads into a temp byte array and so breaks prefetching...the s3afs only sees the read(bytes) of the shorter array, not the full amount wanted
Timothy Miller / @theosib-amazon: The parquet reader has two phases of reading. One does the raw I/O and decompression. Someone is working on an asynchronous implementation of this, which should help a lot. The second phase works on the output of that, providing higher-level data types. My PRs improve on this by eliminating LittleEndianInputStream, which was super inefficient, plus some other improvements in the most critical paths. All of these improvements are incremental, of course, and we're happy to get contributions that improve on this further.
Steve Loughran / @steveloughran: mukund, is there a PR up for this? even though it's not going to be merged, it needs to be shared for others to pick up
Mukund Thakur: Not yet Steve. I plan to do it soon.
We recently added a new feature called vectored IO in Hadoop for improving read performance for seek heavy readers. Spark Jobs and others which uses parquet will greatly benefit from this api. Details can be found here
https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5
https://issues.apache.org/jira/browse/HADOOP-18103
https://issues.apache.org/jira/browse/HADOOP-11867
Reporter: Mukund Thakur Assignee: Steve Loughran / @steveloughran
Related issues:
PRs and other links:
Note: This issue was originally created as PARQUET-2171. Please see the migration documentation for further details.