4ertus2 / ArrowHouse

Some ClickHouse code over Apache Arrow primitives
Apache License 2.0
2 stars 1 forks source link

Improve MergeSorting / MergingSorted streams #1

Open 4ertus2 opened 7 months ago

4ertus2 commented 7 months ago

MergeSorting's input blocks could be already sorted and splitted into not overlapped subsets. MergeSorting could find them by first and last values (min-max) and split into concatinated stream that contains either OneBlock stream either MeringSorted one.

Spitting could be done by DSU structure. We could get all blocks and separete them into subsets by their min-max elements. Updating and keeping subset's min-maxes. In worst case we would have only one subset. It's the case how MergeSorting stream works now: combine all blocks into one MergeSorting stream.

4ertus2 commented 7 months ago

The same DSU idea could be also integrated into MergeSorting algo. Extracting new block from streams we could place them into sets by min-max values and merge overlapping blocks only.

It probably should be a separate DSUMergingSorted stream that groupping blocks before merging them. If we implement it we probable do not need MergeSorting adoption. But just use DSUMergingSortedStream instead of MergingSorted inside.