Open asfimport opened 2 years ago
Balaji K: I now see this extra text under Dictionary encoding:
"Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32). Followed by the values encoded using RLE/Bit packed"
However, putting in the length as described the Hybrid RLE algorithm won't make the Data page correct or be readable I think. The implementation is saying bit width + data.
Would appreciate some clarity on this subject and I can help update the docs :)
The [spec for RLE Dictionary|[https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8]] encoding says the "length of the encoded-data" is placed before the "encoded-data". Reproducing the first 3 lines here:
However, this is not true. Parquet-MR implementation does not encode the length in front of the data. It encodes bitWidth as 1 byte. See [implementation|[https://github.com/apache/parquet-mr/blob/01a5d074829ad4cf4de1f662d54fe7bceb4bef63/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L173]].
I'm proposing the spec be updated to state the above clearly.
see discussion here:
https://lists.apache.org/thread/p45tpjd5r03qbswtpr7xfy072josnjxs
Reporter: Balaji K
Note: This issue was originally created as PARQUET-2108. Please see the migration documentation for further details.