apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.49k stars 1.37k forks source link

Avoid Hadoop interfaces and classes in codecs #2818

Open asfimport opened 9 months ago

asfimport commented 9 months ago

Currently the codecs implemented by Parquet implement the Hadoop Configurable and CompressionCodec interfaces. As part of the effort to decouple from Hadoop there need to be alternatives to these Hadoop implementations such that users are not forced to load Hadoop classes for this purpose at runtime.

Reporter: Atour Mousavi Gourabi / @amousavigourabi

Related issues:

Note: This issue was originally created as PARQUET-2353. Please see the migration documentation for further details.

asfimport commented 9 months ago

Fokko Driesprong / @Fokko: Can you double check if this is still the case with the latest Parquet release? I did some relevant work a while ago: https://github.com/apache/parquet-mr/pull/1074

asfimport commented 9 months ago

Atour Mousavi Gourabi / @amousavigourabi: Hi Fokko, as far as I'm aware https://github.com/apache/parquet-mr/pull/1074 allows for not directly instantiating a Hadoop-based CompressionCodecFactory when reading, iff the user passes their own factory. Currently, however, we do not have any unhadooped CompressionCodecFactory implementations AFAIK (both CodecFactory and DirectCodecFactory will have to deal with a Hadoop CompressionCodec at some point). For the specific codecs, CompressionCodecName refers to 4 codecs from Hadoop itself, and 3 which are implemented in Parquet, but still implement both the Configurable and CompressionCodec interfaces from Hadoop. How I see it, this means the user would have to implement quite a bit of this themselves, which is a pretty big ask. If nobody minds, I'd like to work on this after https://github.com/apache/parquet-mr/pull/1141 is taken care of.