apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.56k stars 1.4k forks source link

Allow for custom compression codecs #1988

Open asfimport opened 8 years ago

asfimport commented 8 years ago

I understand that the list of accepted compression codecs is explicity limited to uncompressed, snappy, gzip, and lzo. (See parquet.hadoop.metadata.CompressionCodecName.java) Is there a reason for this? Or is there an easy workaround? On the surface it seems like an unnecessary restriction.

I ask because I have written a custom codec to implement encryption and I'm unable to use it with Parquet, which is a real shame because it is the main storage format I was hoping to use.

Other thoughts on how to implement encryption in Parquet with this limitation?

Reporter: Steven Anton

Note: This issue was originally created as PARQUET-678. Please see the migration documentation for further details.

asfimport commented 7 years ago

Ashish Singh / @SinghAsDev: This seems like a reasonable thing to have. I am planning to evaluate zstd as a compression codec for Parquet. @rdblue @julienledem any thoughts.

asfimport commented 7 years ago

Cotton Seed: We find lz4 gives similar compression and is about 20% faster for our application. In addition to zstd, I'm sure there is interest in other new compression algorithms, like brotli. It would seem natural for Parquet to work with any Hadoop compression codec. I can work up a patch if there would be interest in accepting it.

asfimport commented 7 years ago

Wes McKinney / @wesm: The format also provides for Brotli compression: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L331

I am sure that LZ4 and zstd would be welcome additions – at least on the C++ side adding these would not cause us much hardship (we have added Brotli support already)

asfimport commented 7 years ago

Uwe Korn / @xhochy: Adding them to parquet-cpp and parquet-format is easy, the only thing that looks a bit harder from my side is to add to Hadoop as a codec so it can be used in parquet-mr. At least for Zstd, this seems to be done already: https://issues.apache.org/jira/browse/HADOOP-13578

asfimport commented 7 years ago

Uwe Korn / @xhochy: [~cotton] A patch would be very welcome, I can help for that on the C++ side once we have a Java patch available.

asfimport commented 6 years ago

Ryan Blue / @rdblue: I think custom codecs is a bad idea. It will only cause compatibility issues to support arbitrary codecs, so I recommend we implement a small set. Probably just adding brotli and zstd.