apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.63k stars 1.41k forks source link

bzip2 compression #2062

Open asfimport opened 7 years ago

asfimport commented 7 years ago

Hi,

I have a requirement to implement Parquet with bzip2 compression because it's splitable. Right now, we can't provide bzip2 in PIG.

SET parquet.compression none/gzip/SNAPPY;

Is there any way to compress to bzip2 on top parquet ?

Reporter: Rajasekhar Konda

Note: This issue was originally created as PARQUET-1011. Please see the migration documentation for further details.

asfimport commented 7 years ago

Uwe Korn / @xhochy: We can add bzip2 to Parquet but this will only change compression, it won't have any effect on splittability. By the design of the format Parquet files are always splittable, independently of the compression algorithm used. This means especially that also GZIP compressed Parquet files are splittable. In your case, it is probably easier to stick with that instead of implementing bzip2 in Parquet.

Still it would be nice to see if bzip2 would improve performance-wise against the currently implemented GZIP/snappy/Brotli codecs.

asfimport commented 7 years ago

Ryan Blue / @rdblue: We're trying to keep the number of codecs to a minimum and I don't think the performance of bzip2 justifies adding it to the small set of codecs in the spec compared to newer codecs like brotli and zstd.