Open asfimport opened 8 years ago
Ashish Singh / @SinghAsDev: This seems like a reasonable thing to have. I am planning to evaluate zstd as a compression codec for Parquet. @rdblue @julienledem any thoughts.
Cotton Seed: We find lz4 gives similar compression and is about 20% faster for our application. In addition to zstd, I'm sure there is interest in other new compression algorithms, like brotli. It would seem natural for Parquet to work with any Hadoop compression codec. I can work up a patch if there would be interest in accepting it.
Wes McKinney / @wesm: The format also provides for Brotli compression: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L331
I am sure that LZ4 and zstd would be welcome additions – at least on the C++ side adding these would not cause us much hardship (we have added Brotli support already)
Uwe Korn / @xhochy: Adding them to parquet-cpp and parquet-format is easy, the only thing that looks a bit harder from my side is to add to Hadoop as a codec so it can be used in parquet-mr. At least for Zstd, this seems to be done already: https://issues.apache.org/jira/browse/HADOOP-13578
I understand that the list of accepted compression codecs is explicity limited to uncompressed, snappy, gzip, and lzo. (See parquet.hadoop.metadata.CompressionCodecName.java) Is there a reason for this? Or is there an easy workaround? On the surface it seems like an unnecessary restriction.
I ask because I have written a custom codec to implement encryption and I'm unable to use it with Parquet, which is a real shame because it is the main storage format I was hoping to use.
Other thoughts on how to implement encryption in Parquet with this limitation?
Reporter: Steven Anton
Note: This issue was originally created as PARQUET-678. Please see the migration documentation for further details.