apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.9k stars 3.38k forks source link

[C++][Parquet] Hardware optimizations for dictionary / RLE encoding/decoding #42513

Open asfimport opened 7 years ago

asfimport commented 7 years ago

See discussion in

https://github.com/apache/parquet-cpp/pull/140

and experiments from Daniel Lemire in

https://github.com/lemire/dictionary

Reporter: Wes McKinney / @wesm Assignee: Deepak Majeti / @majetideepak

Related issues:

Note: This issue was originally created as PARQUET-684. Please see the migration documentation for further details.

asfimport commented 7 years ago

Daniel Lemire: Relevant blog post:

https://github.com/lemire/dictionary

asfimport commented 7 years ago

Deepak Majeti / @majetideepak: I looked at the code and the blog briefly. The current implementation works for dictionary indices that are bit-packed. This implementation will have to be extended to support Rle-Bitpacked hybrid encoding current used by parquet-cpp to encode dictionary index values. Encoding details here: https://github.com/apache/parquet-cpp/blob/master/src/parquet/util/rle-encoding.h#L33

I guess the rle encoding of indices will furture improve the performance since it will not require the costly gather instruction.