apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.88k stars 3.38k forks source link

[C++] C++ IPC reading looks like it doesn't support uncompressed buffer convention for compressed buffers #28016

Open asfimport opened 3 years ago

asfimport commented 3 years ago

https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/reader.cc#L411 does seems to check for the case (I'm not sure if this is the right code though): uncompressed length may be set to -1 to indicate that the data that follows is not compressed, which can be useful for cases where compression does not yield appreciable savings.

https://github.com/apache/arrow/blob/5cabd31c90dbb32d87074928f68bf5d6e97e37c6/format/Message.fbs#L59

Reporter: Micah Kornfield / @emkornfield

Note: This issue was originally created as ARROW-12196. Please see the migration documentation for further details.

asfimport commented 3 years ago

Weston Pace / @westonpace: I know @lidavidm has been doing testing here recently, both compressed and uncompressed, so I feel this isn't quite right.  Although It's possible I am misunderstanding the implication.  I'm guessing David can provide a quick answer.

asfimport commented 3 years ago

David Li / @lidavidm: I think Micah is right - from the spec, there should be a check for uncompressed_length == -1 in which case we should assume the data is already uncompressed. That's distinct from specifying no compression in the first place. I don't think anything implements this optimization right now but you could imagine it being useful. (For instance: if you're transcoding Parquet to Arrow, you could use the Parquet metadata to infer when compression likely isn't worth it. Or you could compress the first few messages of a stream and toggle off compression for columns for which it doesn't seem to help.)

asfimport commented 3 years ago

Micah Kornfield / @emkornfield: Java can emit these values now. The check there is simpler. Try to compress, and if doesn't add anything retain the original buffer. For the write side in C++ since none of this originally was implemented I think we need to control any new functionality that makes use of this via a Write parameter, so we don't break existing clients for a while.