apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.35k stars 3.49k forks source link

[C++][Compute] Pre-solve chunked indices before merging chunks in Sort kernels #44084

Open pitrou opened 2 weeks ago

pitrou commented 2 weeks ago

Describe the enhancement requested

In the chunked sort kernels (for ChunkedArray and Table), the most expensive step can be the recursive merge of sorted chunks after each individual chunk was sorted.

Currently, this merge step resolves chunked indices every time an access is made to read a value. This means chunked resolution is computed O(n*log2(k)) times (where n is the input length and k is the number of chunks).

However, we could instead compute chunked indices after sorting the individual chunks. Then there would be no chunk resolution when merging, just direct accesses through ResolvedChunks.

Component(s)

C++

pitrou commented 2 weeks ago

cc @felipecrv

pitrou commented 2 weeks ago

Two potential downsides to this approach: 1) the size taken by those temporary ResolvedChunks is twice the size of indices, hence a bigger CPU cache footprint 2) there has to be a final "reverse resolution" step where we convert back the sorted ResolvedChunks into absolute indices... or we maintain those absolute indices along the ResolvedChunk, which implies an ever bigger cache footprint

Experimenting will tell whether this can be beneficial.

pitrou commented 1 week ago

I made some initial experiments on this and I came to the following conclusion:

  1. The performance is a mixed bag, with some non-negligible speedups on small input sizes (32k rows in the sort benchmarks) but also apparent slowdowns on larger inputs (8M rows). This is probably a combination of 1) allocation cost, since 16 bytes per input row are allocated for a int64_t pair 2) increased memory footprint and decreased cached efficiency, both because of enlarged indices and the temporary memory area
  2. Therefore, further exploration should go towards 1) compressing resolved indices to make them fit in 64 bits (e.g. 20 bits of chunk_index, 44 bits of index_in_chunk) 2) transforming the logical indices to physical in place before merging the chunks, and transforming them back to physical in place after merging

I might dedicate some time to this.