Open asfimport opened 3 years ago
Weston Pace / @westonpace: I know @lidavidm has been doing testing here recently, both compressed and uncompressed, so I feel this isn't quite right. Although It's possible I am misunderstanding the implication. I'm guessing David can provide a quick answer.
David Li / @lidavidm:
I think Micah is right - from the spec, there should be a check for uncompressed_length == -1
in which case we should assume the data is already uncompressed. That's distinct from specifying no compression in the first place. I don't think anything implements this optimization right now but you could imagine it being useful. (For instance: if you're transcoding Parquet to Arrow, you could use the Parquet metadata to infer when compression likely isn't worth it. Or you could compress the first few messages of a stream and toggle off compression for columns for which it doesn't seem to help.)
Micah Kornfield / @emkornfield: Java can emit these values now. The check there is simpler. Try to compress, and if doesn't add anything retain the original buffer. For the write side in C++ since none of this originally was implemented I think we need to control any new functionality that makes use of this via a Write parameter, so we don't break existing clients for a while.
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/reader.cc#L411 does seems to check for the case (I'm not sure if this is the right code though): uncompressed length may be set to -1 to indicate that the data that follows is not compressed, which can be useful for cases where compression does not yield appreciable savings.
https://github.com/apache/arrow/blob/5cabd31c90dbb32d87074928f68bf5d6e97e37c6/format/Message.fbs#L59
Reporter: Micah Kornfield / @emkornfield
Note: This issue was originally created as ARROW-12196. Please see the migration documentation for further details.