apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.56k stars 3.54k forks source link

Change the default CompressionCodec.Factory to leverage compression support transparently #43469

Closed ccciudatu closed 3 months ago

ccciudatu commented 3 months ago

Describe the enhancement requested

Application code is currently required to choose upfront between handling compressed vs. uncompressed data by specifying one of the two (mutually exclusive) CompressionCodec.Factory implementations: NoCompressionCodec.Factory and CommonsCompressionFactory.

While this is totally acceptable (or even required) for the write path (e.g. ArrowWriter) it makes it really tedious to support compression on the read path, as it's not reasonable to choose between handling uncompressed-data-only and compressed-data-only when writing (e.g.) a client app for Arrow Flight. As already reported in https://github.com/apache/arrow/issues/41457, the Java FlightClient currently fails with the following error when trying to decode a compressed stream:

java.lang.IllegalArgumentException: Please add arrow-compression module to use CommonsCompressionFactory for LZ4_FRAME
    at org.apache.arrow.vector.compression.NoCompressionCodec$Factory.createCodec(NoCompressionCodec.java:63)
    at org.apache.arrow.vector.compression.CompressionCodec$Factory$1.createCodec(CompressionCodec.java:91)
    at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:79)
    at org.apache.arrow.flight.FlightStream.next(FlightStream.java:275)

The FlightStream class does not explicitly pass a compression codec factory when creating a VectorLoader, which then uses the default NoCompressionCodec.Factory. Changing the default to CommonsCompressionFactory is not an option because:

  1. CommonsCompressionFactory does not support uncompressed data
  2. arrow-compression is not a dependency for arrow-vector

Instead of challenging these two design decisions, the proposed solution (upcoming PR) is to make the default CompressionCodec.Factory use a ServiceLoader to gather all the available implementations and combine them to support as many CodecTypes as possible, falling back to NoCompressionCodec.Factory.INSTANCE (i.e. the same default as today).

The arrow-compression module would then act as a service provider, so that whenever it's present in the module- (or class-) path, it will transparently fill in the gaps of the default factory. As a side note, this is in fact the literal meaning of the above error message ("Please add arrow-compression module to use CommonsCompressionFactory"), so we can assume this might have been the original intention.

Component(s)

FlightRPC, Java

danepitkin commented 3 months ago

Issue resolved by pull request 43471 https://github.com/apache/arrow/pull/43471

ccciudatu commented 3 months ago

Issue resolved by pull request 43471 #43471

@danepitkin this also applies to https://github.com/apache/arrow/issues/41457