apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.36k stars 3.49k forks source link

[C++] Move decompression off background reader thread into thread pool #27163

Open asfimport opened 3 years ago

asfimport commented 3 years ago

When reading a decompressed stream there is a fairly decent amount of CPU time spent decompressing that stream.  While we are doing this we could be fetching the next block.  However, the current implementation has the reading and decompressing on the same background reader thread so the next block will not be fetched until the prior block is finished decompressing.

There is still "some" ordering here, it isn't a fan-out, decompression of the blocks has to happen in sequence, but there is some gain to be had.

I created a simple example with gzip here (https://github.com/westonpace/arrow/tree/feature/async-compressed-csv) and you could test it with the attached example program.

On my system, when reading a 250MB gzipped CSV file there is roughly a 5% speedup if the file is cached in the OS (6.3s -> 6.0s) and a 10% to 15% speedup if the file is not cached in the OS. (~6.8s -> 6.0s)

The example requires changing the table reader implementation to receive an async generator.  I think, in practice, we will want to change it to take an async input stream instead.  So this may need to wait until/if we decide to expand the async paradigm into the I/O interfaces.

Reporter: Weston Pace / @westonpace

Original Issue Attachments:

Note: This issue was originally created as ARROW-11262. Please see the migration documentation for further details.

asfimport commented 3 years ago

Wes McKinney / @wesm: This same phenomenon is found many other places in the codebase (notably in IPC write-with-compression and read-with-compression). Rearchitecting everything around async where possible seems like the right path (I think there are various issues around Jira citing specific issues like these).