apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.31k stars 3.48k forks source link

[C++] Selective compression on the wire #24984

Open asfimport opened 4 years ago

asfimport commented 4 years ago

Dask seems to be selectively do compression if it is found to be useful. They sort of pick 10kb of sample upfront to calculate compression and if the results are good then the whole batch is compressed. This seems to save de-compression effort on receiver side.   Please take a look at https://blog.dask.org/2016/04/14/dask-distributed-optimizing-protocol#problem-3-unwanted-compression   Thought this could be relevant to arrow batch transfers as well. 

Reporter: Amol Umbarkar / @mindhash

Note: This issue was originally created as ARROW-8845. Please see the migration documentation for further details.

asfimport commented 4 years ago

Amol Umbarkar / @mindhash: Response from Wes: thanks for pointing that out. Such a heuristic (observing compression ratios of stream messages) could be implemented at some point so that compression could be toggled off mid-stream if it doesn't seem to be helping. Feel free to open a JIRA issue about this

https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gif I just opened https://issues.apache.org/jira/browse/ARROW-8823 since we don't track "what the uncompressed size would have been" without compression turned on.  

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: One limitation is that compression is enabled for entire record batches, but it's quite conceivable that some fields or even individual buffers would compress very well, but others not.

cc @emkornfield   @lidavidm