apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.54k stars 1.39k forks source link

Specification for RLEDictionary encoding is incorrect. #2660

Open asfimport opened 2 years ago

asfimport commented 2 years ago

The [spec for RLE Dictionary|[https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8]] encoding says the "length of the encoded-data" is placed before the "encoded-data". Reproducing the first 3 lines here:


rle-bit-packed-hybrid: <length> <encoded-data>

length := length of the <encoded-data> in bytes stored as 4 bytes little endian (unsigned int32)

encoded-data := <run>\*

However, this is not true. Parquet-MR implementation does not encode the length in front of the data. It encodes bitWidth as 1 byte. See [implementation|[https://github.com/apache/parquet-mr/blob/01a5d074829ad4cf4de1f662d54fe7bceb4bef63/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L173]].

I'm proposing the spec be updated to state the above clearly.

see discussion here:

https://lists.apache.org/thread/p45tpjd5r03qbswtpr7xfy072josnjxs

 

Reporter: Balaji K

Note: This issue was originally created as PARQUET-2108. Please see the migration documentation for further details.

asfimport commented 2 years ago

Balaji K: I now see this extra text under Dictionary encoding:

"Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32). Followed by the values encoded using RLE/Bit packed" 

However, putting in the length as described the Hybrid RLE algorithm won't make the Data page correct or be readable I think. The implementation is saying bit width + data.

Would appreciate some clarity on this subject and I can help update the docs :)