apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.56k stars 1.4k forks source link

Improve the RLE encoding for Parquet Dictionary IDs #2073

Open asfimport opened 7 years ago

asfimport commented 7 years ago

The IDs of Parquet Dictionary encoding is using RunLengthBitPackingHybridEncoder. RunLengthBitPackingHybridEncoder handles encoding with repeat and bitpacking, we should improve it with the method likes DeltaBinaryPackingWriter

Reporter: Dapeng Sun / @sundapeng

Note: This issue was originally created as PARQUET-1059. Please see the migration documentation for further details.

asfimport commented 7 years ago

Uwe Korn / @xhochy: Can you describe a workload where this would bring a significant difference? The need of delta encoding in the dictionary indices rather indicates that you have many distinct values in the column.

asfimport commented 7 years ago

Dapeng Sun / @sundapeng: Hi @xhochy,

Can you describe a workload where this would bring a significant difference? In my case, the values of column may be incremental or decreasing, but the change of the adjoining values is very small, so the dictionary IDs may also be adjoining or near. If the IDs are encoding with Delta, I think it would save more disk space.

asfimport commented 7 years ago

Wes McKinney / @wesm: Since there are so many implementations of encoding and decoding the dictionary indices, changing the behavior of the encoder would likely not be possible without either introducing a new encoding type or reserving the change for a future major Parquet version.

asfimport commented 7 years ago

Dapeng Sun / @sundapeng: Hi @wesm, thank you for your comments, how about create a new write version, such as PARQUET_3_0 or PARQUET_2_1 , I think this optimization would be easy put into a new WRITE_VERSION.