Open asfimport opened 7 years ago
Uwe Korn / @xhochy: Can you describe a workload where this would bring a significant difference? The need of delta encoding in the dictionary indices rather indicates that you have many distinct values in the column.
Dapeng Sun / @sundapeng: Hi @xhochy,
Can you describe a workload where this would bring a significant difference? In my case, the values of column may be incremental or decreasing, but the change of the adjoining values is very small, so the dictionary IDs may also be adjoining or near. If the IDs are encoding with Delta, I think it would save more disk space.
Wes McKinney / @wesm: Since there are so many implementations of encoding and decoding the dictionary indices, changing the behavior of the encoder would likely not be possible without either introducing a new encoding type or reserving the change for a future major Parquet version.
Dapeng Sun / @sundapeng:
Hi @wesm, thank you for your comments, how about create a new write version, such as PARQUET_3_0
or PARQUET_2_1
, I think this optimization would be easy put into a new WRITE_VERSION.
The IDs of Parquet Dictionary encoding is using
RunLengthBitPackingHybridEncoder
. RunLengthBitPackingHybridEncoder handles encoding withrepeat
andbitpacking
, we should improve it with the method likesDeltaBinaryPackingWriter
Reporter: Dapeng Sun / @sundapeng
Note: This issue was originally created as PARQUET-1059. Please see the migration documentation for further details.