Open pitrou opened 2 weeks ago
cc @felipecrv
Two potential downsides to this approach:
1) the size taken by those temporary ResolvedChunk
s is twice the size of indices, hence a bigger CPU cache footprint
2) there has to be a final "reverse resolution" step where we convert back the sorted ResolvedChunk
s into absolute indices... or we maintain those absolute indices along the ResolvedChunk
, which implies an ever bigger cache footprint
Experimenting will tell whether this can be beneficial.
I made some initial experiments on this and I came to the following conclusion:
int64_t
pair
2) increased memory footprint and decreased cached efficiency, both because of enlarged indices and the temporary memory areachunk_index
, 44 bits of index_in_chunk
)
2) transforming the logical indices to physical in place before merging the chunks, and transforming them back to physical in place after mergingI might dedicate some time to this.
Describe the enhancement requested
In the chunked sort kernels (for ChunkedArray and Table), the most expensive step can be the recursive merge of sorted chunks after each individual chunk was sorted.
Currently, this merge step resolves chunked indices every time an access is made to read a value. This means chunked resolution is computed
O(n*log2(k))
times (wheren
is the input length andk
is the number of chunks).However, we could instead compute chunked indices after sorting the individual chunks. Then there would be no chunk resolution when merging, just direct accesses through
ResolvedChunk
s.Component(s)
C++