apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.65k stars 3.55k forks source link

[C++] Investigate radix sort for integer arrays #26832

Open asfimport opened 3 years ago

asfimport commented 3 years ago

For integer arrays with a non-tiny range of values, we currently use a stable sort. It may be faster to use a radix sort instead.

Reporter: Antoine Pitrou / @pitrou

Original Issue Attachments:

Note: This issue was originally created as ARROW-10899. Please see the migration documentation for further details.

asfimport commented 3 years ago

Kirill Lykov / @KirillLykov: Sounds interesting to me, I would like to have a look. Could give some more details from where to start? I mean:

asfimport commented 3 years ago

Yibo Cai / @cyb70289: @pitrou, please correct me if I missed somthing.

Code: vector_sort.cc Unit test: vector_sort_test.cc Benchmark: vector_sort_benchmark.cc

Please note current code supports sorting multiple columns, so looks a bit complex. I think we can start with single array sorter. Specifically, below code line calling std::stable_sort https://github.com/apache/arrow/blob/a321cdedb75b27b669995065ffe5596b3cfb3ae4/cpp/src/arrow/compute/kernels/vector_sort.cc#L378

asfimport commented 3 years ago

Kirill Lykov / @KirillLykov: Thanks for the clarification, I will check it out next week.

asfimport commented 3 years ago

Kirill Lykov / @KirillLykov:

  1. From the paper "Fast Sort on CPUs and GPUs: A Case for Bandwidth Oblivious SIMD Sort" by Satish et al: They compare, beside other things, performance of radix sort and merge sort using intrinsics. Both implementations are highly optimized for simd architecture and radix has many not obvious improvements to make it more cache-friendly. Radix sort performance depends on the key size because the longer key we have the more passes we need to do. Radix is bandwidth-bound starting at 48-bits keys. Further, Satish et al report that merge sort is 1.7X slower than radix sort on 32- bit keys, but becomes comparable in performance to radix on 64-bit keys and 1.3X better on 128-bit keys. Their conclusion is that "bandwidth-oblivious SIMD friendly merge sort, should, therefore, be the sorting method of choice for future databases".

  2. I haven't found a good opensource radix sort for CPUs. There is one in boost called spread sort, looks promising. Alternatively one could implement Satish paper but this code will be difficult to support 

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: @KirillLykov Feel free to experiment, but note that adding SIMD into the mix will add a layer of complication. For a non-SIMD implementation, my intuition is that a radix sort could be significantly faster on large data than the current merge sort (a.k.a std::stable_sort).

asfimport commented 3 years ago

Kirill Lykov / @KirillLykov: Your intuition is correct, at least for int32_t key. I benchmarked separately from arrow std::sort, std::stable_sort, boost::spreadsort, myNaiveRadixSort. And surprisingly to me even naive radix sort is faster than std sorts. But boost's spread sort is better. Y-axis is kMilliseconds, X-axis is number of elements in the array. It suggests to use boost::spreadsort

Screen Shot 2021-02-09 at 17.48.13.png

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Note that we need a stable sort, which spreadsort apparently isn't.

asfimport commented 3 years ago

Kirill Lykov / @KirillLykov: Right, I will check also spinsort from boost. I would add why spreadsort is not stable – if array is small or integer doesn't fit to boost::uintmax_t, it uses pbsort which is unstable. If everything is fine it is using radix sort, which is what we want. One solution is to implement stable_spinsort which will use something else instead of pbsort. 

Below is the plot for these sorting algorithms for int64_t (one above is for int32_t).

Screen Shot 2021-02-10 at 10.58.23.png

Looks like stable spinsort is not better than std::stable_sort and this is expected.  I believe that if we want to use primarily keys shorter or equal to 64 bits, it makes sense to look into a-la spreadsort implementation. For that, it is possible to reuse some code from boost::sort::spreadsort::details but this is might be a bad idea since it is not part of the public interface.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Even a radix sort is not necessarily stable. You have to be careful when implementing it: see https://en.wikipedia.org/wiki/Radix_sort

Also, I assume your benchmarks are on random data? Data is very often not randomly distributed. On real-world data, a int32 or int64 column doesn't necessarily span the whole int32 or int64 range, for example.

Perhaps heuristics would need to be devised, based on the input range and the input size you would either use a radix sort (which is O(n)) or a comparison-based sort such as spinsort (which is O(n log n)).

asfimport commented 3 years ago

Kirill Lykov / @KirillLykov: On a higher level I honestly thought that I will easily find a suitable opensource sort implementation. So did just basic benchmarking. I think I will create a separate repo just for benchmarking existing sorting algorithms alone together with candidate implementation. 

asfimport commented 3 years ago

Michal Nowakiewicz / @michalursa: I don't have any good advice on open source radix sort, but let me share some general thoughts about radix sort implementation:

  1. Regarding SIMD. It seems to me that unstable radix sort should be relatively simple to implement using AVX-512 instruction set (AVX2 is a bit more challenging, since it is missing for instance scatter instruction for memory stores). Stable radix sort is a bit harder with SIMD but still doable. Is it worth it? I believe that the majority of the cost in radix sort is in the data movement (memory stores to non-consecutive addresses), especially if the data does not fit into CPU cache. SIMD will not help with that. But it will cut the cost of everything else. I would expect visible improvement but nothing close to 2x speedup. It's probably something better left for later.
  2. Regarding papers. http://www.cs.columbia.edu/~orestis/sigmod14I.pdf - This paper may be a bit too much - it focuses on SIMD, parallel processing, optimizing for multiple NUMA nodes and it goes into a lot of depth and touches some other topics as well - but at the time it was published it looked to me like a pretty good implementation. There may be something interesting in their results and conclusions, even though they tested their solution on much bigger data sets.
  3. Regarding MSB (most significant bits first) vs LSB (least significant bits first) radix sort. MSB should be faster on larger data (compared to the size of CPU cache), more cache-friendly, and can be implemented as a stable sort.
  4. Regarding hybrid sorting. I imagine that a good solution would be to run MSB radix sort at first, producing a set of data partitions based on the highest bits of the sort key. Once the partitions get small enough (e.g. fit L1 cache) I would guess that radix sort is no longer the best solution for sorting data (it should get more advantage over comparison based sorting when the input set is large). At that point a different sorting algorithm could be better suited, so any sort_of_choice can be called on the data within radix partition. If radix sort is faster on small arrays (thousands of elements) - great - but if it isn't, then using another sort method for small partitions seems like a good idea.
  5. Regarding data distribution. I think that data distribution is pretty important in designing and testing radix sort. For instance, if we sort 1 million 64-bit uniformly distributed integers, even though the space of potential values is huge, there are at most 1 million unique values. If I had to sort 1M of 64-bit integers I would consider doing it using range partitioning: a) use a small sample of data (e.g. every Nth value for a double digit N) to compute a histogram (sort the sample, using any kind of sort, divide into K equal size buckets after sorting for range partitioning later into K ranges); b) use range partitioning based on the histogram to split data into multiple ranges; c) continue the two previous steps recursively until the partition size is small and then call a sort_of_choice function on that partition. Histogram based range partitioning should take care of data skew and create approximately equal sized partitions. That means that for 1M values, in order to get down to 1K elements partitions, about 1K partitions need to be generated. That can be done for instance as a single pass of range partitioning with 1024 partitions or two passes with 32 partitions each. Even though a single round of range partitioning (bucket sort) is obviously more expensive than pure radix sort pass (extra histogram and finding a range for each value given range boundaries) it seems to me like a good candidate, since it lowers the total number of partitioning steps that are needed (compared to pure radix sort of 64-bit integers that would use something like 8 passes of 8-bit radix partitioning phases). And SIMD can definitely help performance-wise in finding the right bucket for each value using range boundaries.
asfimport commented 3 years ago

Kirill Lykov / @KirillLykov: I added a repository to put there some experiments i've done for the earlier added plots: https://github.com/KirillLykov/int-sort-bmk

Unfortunately, I couldn't achieve a fast progress on this ticket and since it is not my main activity I decided to freeze it on my side. By fast progress I mean delivering a stable non-comparison-based sorting algorithm which is faster than std::stable_sort. Naiveradix sort which is implemented there is much slower on int64_t as one might find by looking into the plots in scripts/imgs.

The last thing that I was trying to do is to modify boost's integer_sort to make it stable (as unstable version is really fast). To simplify experiments with integer_sort I've extracted it in one separate file called https://github.com/KirillLykov/int-sort-bmk/blob/master/src/boost_spread_sort.h One can find that integer_sort sometimes relies on pdqsort. This can be replaced with stable_sort. A more interesting part of the code which I think makes integer_sort unstable is https://github.com/KirillLykov/int-sort-bmk/blob/master/src/boost_spread_sort.h#L210 I think in-place version can be replaced with non-in-place.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Thanks for the update @KirillLykov. It will be useful if someone else takes up this issue.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Interesting read here: https://travisdowns.github.io/blog/2019/05/22/sorting.html

asfimport commented 3 years ago

Kirill Lykov / @KirillLykov: Thanks for the reference to the blog, I read all of his posts. 

I've checked with my benchmarks Travis' final radix_sort7 version, see below. It rocks!

all_random_wholeRange.png

There is no license file in his repo, so I cannot share my experiments.

There might be several ways to proceed. It looks it would be good to ask Travis to contribute to Arrow. What do you think?

I've added issue to his repo to add license at this point, see https://github.com/travisdowns/sort-bench/issues/1

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: I don't know if Travis would be interested, as plumbing work would probably dominate the actual optimization work (but I don't know what his interests are). That said, it probably doesn't hurt to ask him :-)

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Also, AFAIU his radix sort implementation is simply a left-to-right (most significant digit first) radix sort, so should be stable unless there is a bug.

asfimport commented 3 years ago

Kirill Lykov / @KirillLykov: Right, I agree that it is stable. From performance prospective, the main difference with naive implementation of Cormen's pseudo code: 1) Skip unnecessary iterations (if all the numbers have current bits (8 bits in Travis code) set to the same value). Gives ~36% 2) Instead of using offset array use array of pointers which takes into account these offsets. So instead of out[offset[(value>>shift)&bitmask]]=123do pOut[(value>>shift)&bitmask] = 123 where pOut is precomputed to be &out[0] + offset[]. Gives ~24%

He also uses __builtin_prefetch from gcc library, I'm not sure if it is portable. It might give 12% performance on some large datasets.

I will write to Travis directly. Probably, he will like to contribute a version of this sort and I will do plumbing

asfimport commented 3 years ago

Kirill Lykov / @KirillLykov: Updates, I've contacted Travis and he wrote that he has a lack of time to work on it yet it seems interesting to him. He mentioned also that MSD implementation might be better for bandwidth.

I'm thinking about the way to proceed. One option might be

  1. Travis will create a PR just with his radix sort implementation
  2. I will add tests and benchmarks to his PR

One thing which kind of worries me is that sorting requires quite intense ad-hoc testing and benchmarking, are we fine with adding all these to the Arrow repository?

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: I'm not sure we need intense benchmarking. Admittedly, it's better to test sorting on all kinds of input (almost sorted, random, etc.). However, we don't necessarily need that if we just want to find a cutoff point between radix sorting and std::stable_sort (which is probably a merge sort with fallback to a simpler sort for small sizes).

asfimport commented 3 years ago

Kirill Lykov / @KirillLykov: I'm continuing experiments with MSD sort. I'm using gist as log for optimizations https://gist.github.com/KirillLykov/c641e63adfd68591bafbdb342f75d141

Please ignore code style sine it is for experiments. Each comment in this gist is an iteration of optimization. Please let me know someone has ideas to try. I was planing try to get rid of recursion by using stack explicitly. Don't see any other big things to do with it.

all_random_wholeRange.png

Idea by @michalursa to use another sorting algorithm if the size of the subarray to sort is smaller than some level surprisingly doesn't work well on my array (for now only uniform distribution).

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: You should probably bench against radix_sort6 because radix_sort7 merely adds an explicit prefetch which is a bit orthogonal.

asfimport commented 3 years ago

Kirill Lykov / @KirillLykov: Actually, I don't see significant effect of prefetch in my benchmarks. Probably, because I use clang. Now wondering how it is compiling. So it might be considered as radix_sort6.

asfimport commented 3 years ago

Kirill Lykov / @KirillLykov: Do we have some information about distribution of integers? And also are they signed/unsigned?

I've did some more benchmarking and plots https://github.com/KirillLykov/int-sort-bmk. In short, radix sort is almost always performs better than std::stable_sort. But there are some cases when stable_sort is better – sorted sequence, all elements are equal.

Another observation is that it  looks like LSD implementation is faster than MSD implementation.  Yet it is possible to have almost the same performance using MSD/LSD hybrid algorithm if we think that MSD is better for memory bandwidth.

The question is how to measure memory bandwidth consumption, I employed LLC-cache-misses events. 

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: We can't presume anything about the distribution of integers. It could be anything. Also, they may be signed or unsigned (but I don't think that makes a lot of difference, does it?).

Do note that we need a stable sort, so I don't think using LSD can work.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Thanks a lot for running these benchmarks, by the way. It seems that std::stable_sort is already quite good on several cases.

asfimport commented 3 years ago

Kirill Lykov / @KirillLykov: I believe that the LSD implementation by Travis is stable by construction, I will recheck it. Regarding stable_sort – i'm trying to find a way to rely on stable_sort instead of radix sort for particular types of distribution. These checks are done in boost::spread_sort but I haven't got the math behind yet. In particular, they detect is sequence is sorted and find min/max values. From that they can conclude if the sequence is bucket sorted or it is not sorted but better to be handled by another algorithm.

asfimport commented 3 years ago

Kirill Lykov / @KirillLykov: I have to drop this ticket but if someone will take it, I will be glad to assist. Below is a summary of the research I did.

First of all I couldn't find ready to use stable radix sort implementation which is fast and generic. boost::spread_sort is the best available implementation of in-place radix sort because it designed to perform well on all the inputs. On the contrary, vanilla radix sort will perform poorly on some types of distributions as shown on the plot below. spread_sort performs better for two reasons:

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Thanks for the investigation! Were you able to check on the claim that Travis' LSD sort is stable?

asfimport commented 3 years ago

Kirill Lykov / @KirillLykov: LSD sort is stable. Do you want to try with Travis' LSD sort implementation? If yes, we can discuss what shall be done. Like I can change the style of the code, make it templated, support negative, etc.

Making perfect adaptive stable generic radix looks like too time consuming task for me, but adapting LSD is simpler.

Yet about bandwidth. Travis wrote me that he originally wanted to have two posts – one about LSD and another about MSD. And he prefers MSD due to lower bandwidth consumption. It sounds logical by looking to algorithms themselfs – LSD will visit all the radixes anyway while MSD will build a tree of recursive calls which generally speaking is not balanced so makes less memory operations. I tried to estimate bandwidth by using LLC-misses in perf, results are in the table there https://github.com/KirillLykov/int-sort-bmk#msd-vs-lsd-radix-sort Yet MSD is more complicated to implement and less performant, so if doing MSD-like approach I would go with adaptive MSD like one from boost What do you think about this?

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Good question about bandwidth!

A possible solution for the bandwidth issue may be to 1) run a LSD radix sort on small-sized chunks (for example 64kB or 128kB, to fit in L2) 2) run a merge sort on the resulting sorted chunks. I'm not sure what the performance would be.

asfimport commented 3 years ago

Kirill Lykov / @KirillLykov: Well, it looks like it is called "wolf sort" (https://github.com/scandum/wolfsort). I compared the implementation of Igor with Travis lsd and stable sort on uniformly distributed uint64_t from the range 0..1e9, see

uniform_1B_wolf.png

I don't know if his implementation is good/bad. I just took it and made it a bit more typed to be compiled by C++ compiler, see https://github.com/KirillLykov/int-sort-bmk/blob/wolfSortCheck/ branch if really interested

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Ha, the search space for sorting algorithms looks a bit crowded :-)

Generally speaking, I don't think there's a need to press forward on this issue. But your experiments and findings are extremely useful and will probably serve us in the future, so thanks a lot!

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Also worth taking a look at https://github.com/scandum/quadsort , even if it's not radix-based.