apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.48k stars 3.52k forks source link

[C++][Parquet] Implement and test BIT_PACKED level encoding / decoding #42345

Closed asfimport closed 8 years ago

asfimport commented 8 years ago

While RLE is the preferred encoding format (and BIT_PACKED is deprecated in Parquet 2.0), we will need to support this encoding format for legacy Parquet files that use it. As part of this JIRA we will verify round-tripping levels to this encoding format.

See also PARQUET-462

Reporter: Wes McKinney / @wesm Assignee: Deepak Majeti / @majetideepak

Note: This issue was originally created as PARQUET-467. Please see the migration documentation for further details.

asfimport commented 8 years ago

Wes McKinney / @wesm: Per PARQUET-462 we can go ahead and implement this level encoding. Will leave this issue open until it's been tested sufficiently

asfimport commented 8 years ago

Deepak Majeti / @majetideepak: Doesn't the rle decoder from Impala also handle bit_packed ?

asfimport commented 8 years ago

Wes McKinney / @wesm: No, not the BIT_PACKED encoding as specified. The hybrid RLE / bit-packed encoding has indicator values prefixing each RLE or literal run telling you which decoding code path to use for each chunk of levels. In the case of BIT_PACKED there are no such indicator values IIUC.

asfimport commented 8 years ago

Deepak Majeti / @majetideepak: I thought the Impala code uses a short cut to read bit-packed values via the BitReader instance in rle decoder. Can you please verify ? https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/exec/hdfs-parquet-scanner.cc#L1247 https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/exec/hdfs-parquet-scanner.cc#L1269 Does the BIT_PACKED referenced above have a different spec ?

asfimport commented 8 years ago

Wes McKinney / @wesm: Please see https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/exec/hdfs-parquet-scanner.cc#L195 and let me know if not clear

asfimport commented 8 years ago

Deepak Majeti / @majetideepak: I still don't get it. Can't we use the BitReader to decode BIT_PACKED values? It looks like even the boolean decoder uses the same. https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/exec/hdfs-parquet-scanner.cc#L860 I believe PARQUET-454 can be fixed using BitReader as well.

asfimport commented 8 years ago

Wes McKinney / @wesm: Yes:

The Impala code has a LevelDecoder class extending RleDecoder which calls into BitReader where appropriate. I would prefer to create a clean abstraction (i.e. do not use inheritance) that doesn't conflate the two encoding styles (which are different: BIT_PACKED is not RLE encoding, even though the RLE encoding uses bitpacking to save space). I would be happy to take on these JIRAs if it is not clear, please let me know.

asfimport commented 8 years ago

Deepak Majeti / @majetideepak: I am sorry, but I am confused with the scope of this JIRA. If we already have the necessary bits for BIT_PACKED encoding/ decoding, what are we planning to implement here?

asfimport commented 8 years ago

Wes McKinney / @wesm: If you try to read a file that uses this encoding, BIT_PACKED is not implemented. There are also no tests to verify whether or not it works.

asfimport commented 8 years ago

Deepak Majeti / @majetideepak: The word "implement" threw me off track and for a moment I assumed you were re-implementing BitReader. Can I support BIT_PACKED in PARQUET-169 and use this JIRA to add the test cases ?

asfimport commented 8 years ago

Wes McKinney / @wesm: Yes (both JIRAs need test cases, PARQUET-485 will make that easier/possible).

asfimport commented 8 years ago

Julien Le Dem / @julienledem: We expect the tests for a given implementation to be submitted together with the implementation. "and test" is implicit.

asfimport commented 8 years ago

Deepak Majeti / @majetideepak: Issue resolved by pull request 30 https://github.com/apache/parquet-cpp/pull/30