Open asfimport opened 3 years ago
Kirill Lykov / @KirillLykov: Sounds interesting to me, I would like to have a look. Could give some more details from where to start? I mean:
Yibo Cai / @cyb70289: @pitrou, please correct me if I missed somthing.
Code: vector_sort.cc Unit test: vector_sort_test.cc Benchmark: vector_sort_benchmark.cc
Please note current code supports sorting multiple columns, so looks a bit complex. I think we can start with single array sorter. Specifically, below code line calling std::stable_sort https://github.com/apache/arrow/blob/a321cdedb75b27b669995065ffe5596b3cfb3ae4/cpp/src/arrow/compute/kernels/vector_sort.cc#L378
Kirill Lykov / @KirillLykov: Thanks for the clarification, I will check it out next week.
Kirill Lykov / @KirillLykov:
From the paper "Fast Sort on CPUs and GPUs: A Case for Bandwidth Oblivious SIMD Sort" by Satish et al: They compare, beside other things, performance of radix sort and merge sort using intrinsics. Both implementations are highly optimized for simd architecture and radix has many not obvious improvements to make it more cache-friendly. Radix sort performance depends on the key size because the longer key we have the more passes we need to do. Radix is bandwidth-bound starting at 48-bits keys. Further, Satish et al report that merge sort is 1.7X slower than radix sort on 32- bit keys, but becomes comparable in performance to radix on 64-bit keys and 1.3X better on 128-bit keys. Their conclusion is that "bandwidth-oblivious SIMD friendly merge sort, should, therefore, be the sorting method of choice for future databases".
I haven't found a good opensource radix sort for CPUs. There is one in boost called spread sort, looks promising. Alternatively one could implement Satish paper but this code will be difficult to support
Antoine Pitrou / @pitrou:
@KirillLykov Feel free to experiment, but note that adding SIMD into the mix will add a layer of complication.
For a non-SIMD implementation, my intuition is that a radix sort could be significantly faster on large data than the current merge sort (a.k.a std::stable_sort
).
Kirill Lykov / @KirillLykov: Your intuition is correct, at least for int32_t key. I benchmarked separately from arrow std::sort, std::stable_sort, boost::spreadsort, myNaiveRadixSort. And surprisingly to me even naive radix sort is faster than std sorts. But boost's spread sort is better. Y-axis is kMilliseconds, X-axis is number of elements in the array. It suggests to use boost::spreadsort
Antoine Pitrou / @pitrou: Note that we need a stable sort, which spreadsort apparently isn't.
Kirill Lykov / @KirillLykov: Right, I will check also spinsort from boost. I would add why spreadsort is not stable – if array is small or integer doesn't fit to boost::uintmax_t, it uses pbsort which is unstable. If everything is fine it is using radix sort, which is what we want. One solution is to implement stable_spinsort which will use something else instead of pbsort.
Below is the plot for these sorting algorithms for int64_t (one above is for int32_t).
Looks like stable spinsort is not better than std::stable_sort and this is expected. I believe that if we want to use primarily keys shorter or equal to 64 bits, it makes sense to look into a-la spreadsort implementation. For that, it is possible to reuse some code from boost::sort::spreadsort::details but this is might be a bad idea since it is not part of the public interface.
Antoine Pitrou / @pitrou: Even a radix sort is not necessarily stable. You have to be careful when implementing it: see https://en.wikipedia.org/wiki/Radix_sort
Also, I assume your benchmarks are on random data? Data is very often not randomly distributed. On real-world data, a int32 or int64 column doesn't necessarily span the whole int32 or int64 range, for example.
Perhaps heuristics would need to be devised, based on the input range and the input size you would either use a radix sort (which is O(n)) or a comparison-based sort such as spinsort (which is O(n log n)).
Kirill Lykov / @KirillLykov: On a higher level I honestly thought that I will easily find a suitable opensource sort implementation. So did just basic benchmarking. I think I will create a separate repo just for benchmarking existing sorting algorithms alone together with candidate implementation.
Michal Nowakiewicz / @michalursa: I don't have any good advice on open source radix sort, but let me share some general thoughts about radix sort implementation:
Kirill Lykov / @KirillLykov: I added a repository to put there some experiments i've done for the earlier added plots: https://github.com/KirillLykov/int-sort-bmk
Unfortunately, I couldn't achieve a fast progress on this ticket and since it is not my main activity I decided to freeze it on my side. By fast progress I mean delivering a stable non-comparison-based sorting algorithm which is faster than std::stable_sort. Naiveradix sort which is implemented there is much slower on int64_t as one might find by looking into the plots in scripts/imgs.
The last thing that I was trying to do is to modify boost's integer_sort to make it stable (as unstable version is really fast). To simplify experiments with integer_sort I've extracted it in one separate file called https://github.com/KirillLykov/int-sort-bmk/blob/master/src/boost_spread_sort.h One can find that integer_sort sometimes relies on pdqsort. This can be replaced with stable_sort. A more interesting part of the code which I think makes integer_sort unstable is https://github.com/KirillLykov/int-sort-bmk/blob/master/src/boost_spread_sort.h#L210 I think in-place version can be replaced with non-in-place.
Antoine Pitrou / @pitrou: Thanks for the update @KirillLykov. It will be useful if someone else takes up this issue.
Antoine Pitrou / @pitrou: Interesting read here: https://travisdowns.github.io/blog/2019/05/22/sorting.html
Kirill Lykov / @KirillLykov: Thanks for the reference to the blog, I read all of his posts.
I've checked with my benchmarks Travis' final radix_sort7 version, see below. It rocks!
There is no license file in his repo, so I cannot share my experiments.
There might be several ways to proceed. It looks it would be good to ask Travis to contribute to Arrow. What do you think?
I've added issue to his repo to add license at this point, see https://github.com/travisdowns/sort-bench/issues/1
Antoine Pitrou / @pitrou: I don't know if Travis would be interested, as plumbing work would probably dominate the actual optimization work (but I don't know what his interests are). That said, it probably doesn't hurt to ask him :-)
Antoine Pitrou / @pitrou: Also, AFAIU his radix sort implementation is simply a left-to-right (most significant digit first) radix sort, so should be stable unless there is a bug.
Kirill Lykov / @KirillLykov:
Right, I agree that it is stable.
From performance prospective, the main difference with naive implementation of Cormen's pseudo code:
1) Skip unnecessary iterations (if all the numbers have current bits (8 bits in Travis code) set to the same value). Gives ~36%
2) Instead of using offset array use array of pointers which takes into account these offsets. So instead of out[offset[(value>>shift)&bitmask]]=123
do pOut[(value>>shift)&bitmask] = 123
where pOut is precomputed to be &out[0] + offset[]
. Gives ~24%
He also uses __builtin_prefetch
from gcc library, I'm not sure if it is portable. It might give 12% performance on some large datasets.
I will write to Travis directly. Probably, he will like to contribute a version of this sort and I will do plumbing
Kirill Lykov / @KirillLykov: Updates, I've contacted Travis and he wrote that he has a lack of time to work on it yet it seems interesting to him. He mentioned also that MSD implementation might be better for bandwidth.
I'm thinking about the way to proceed. One option might be
One thing which kind of worries me is that sorting requires quite intense ad-hoc testing and benchmarking, are we fine with adding all these to the Arrow repository?
Antoine Pitrou / @pitrou:
I'm not sure we need intense benchmarking. Admittedly, it's better to test sorting on all kinds of input (almost sorted, random, etc.). However, we don't necessarily need that if we just want to find a cutoff point between radix sorting and std::stable_sort
(which is probably a merge sort with fallback to a simpler sort for small sizes).
Kirill Lykov / @KirillLykov: I'm continuing experiments with MSD sort. I'm using gist as log for optimizations https://gist.github.com/KirillLykov/c641e63adfd68591bafbdb342f75d141
Please ignore code style sine it is for experiments. Each comment in this gist is an iteration of optimization. Please let me know someone has ideas to try. I was planing try to get rid of recursion by using stack explicitly. Don't see any other big things to do with it.
Idea by @michalursa to use another sorting algorithm if the size of the subarray to sort is smaller than some level surprisingly doesn't work well on my array (for now only uniform distribution).
Antoine Pitrou / @pitrou:
You should probably bench against radix_sort6
because radix_sort7
merely adds an explicit prefetch which is a bit orthogonal.
Kirill Lykov / @KirillLykov: Actually, I don't see significant effect of prefetch in my benchmarks. Probably, because I use clang. Now wondering how it is compiling. So it might be considered as radix_sort6.
Kirill Lykov / @KirillLykov: Do we have some information about distribution of integers? And also are they signed/unsigned?
I've did some more benchmarking and plots https://github.com/KirillLykov/int-sort-bmk. In short, radix sort is almost always performs better than std::stable_sort. But there are some cases when stable_sort is better – sorted sequence, all elements are equal.
Another observation is that it looks like LSD implementation is faster than MSD implementation. Yet it is possible to have almost the same performance using MSD/LSD hybrid algorithm if we think that MSD is better for memory bandwidth.
The question is how to measure memory bandwidth consumption, I employed LLC-cache-misses events.
Antoine Pitrou / @pitrou: We can't presume anything about the distribution of integers. It could be anything. Also, they may be signed or unsigned (but I don't think that makes a lot of difference, does it?).
Do note that we need a stable sort, so I don't think using LSD can work.
Antoine Pitrou / @pitrou:
Thanks a lot for running these benchmarks, by the way. It seems that std::stable_sort
is already quite good on several cases.
Kirill Lykov / @KirillLykov: I believe that the LSD implementation by Travis is stable by construction, I will recheck it. Regarding stable_sort – i'm trying to find a way to rely on stable_sort instead of radix sort for particular types of distribution. These checks are done in boost::spread_sort but I haven't got the math behind yet. In particular, they detect is sequence is sorted and find min/max values. From that they can conclude if the sequence is bucket sorted or it is not sorted but better to be handled by another algorithm.
Kirill Lykov / @KirillLykov: I have to drop this ticket but if someone will take it, I will be glad to assist. Below is a summary of the research I did.
First of all I couldn't find ready to use stable radix sort implementation which is fast and generic. boost::spread_sort is the best available implementation of in-place radix sort because it designed to perform well on all the inputs. On the contrary, vanilla radix sort will perform poorly on some types of distributions as shown on the plot below. spread_sort performs better for two reasons:
it uses adaptive size of the radix rather than fixed
What makes spreadsort unstable is that it sometimes relies on pdqsort which is unstable and that it is in-place sort. Hence, it is possible to make it stable by replacing calls to pdqsort to stable_sort and by using additional buffer instead of doing in-place sorting. Another argument for starting from spread sort code is that it is tested on different platforms (so less risk for surprises), it works with different datatypes.
Beside of developing a new efficient implementation of adaptive radix sort, there is also work need to be done to integrate this sort. Since right now a custom comparator is used with std::stable_sort.
Having in mind also extensive testing/bmk (different distributions, different data types) and profiling, I would estimate this task for 80-120h
Antoine Pitrou / @pitrou: Thanks for the investigation! Were you able to check on the claim that Travis' LSD sort is stable?
Kirill Lykov / @KirillLykov: LSD sort is stable. Do you want to try with Travis' LSD sort implementation? If yes, we can discuss what shall be done. Like I can change the style of the code, make it templated, support negative, etc.
Making perfect adaptive stable generic radix looks like too time consuming task for me, but adapting LSD is simpler.
Yet about bandwidth. Travis wrote me that he originally wanted to have two posts – one about LSD and another about MSD. And he prefers MSD due to lower bandwidth consumption. It sounds logical by looking to algorithms themselfs – LSD will visit all the radixes anyway while MSD will build a tree of recursive calls which generally speaking is not balanced so makes less memory operations. I tried to estimate bandwidth by using LLC-misses in perf, results are in the table there https://github.com/KirillLykov/int-sort-bmk#msd-vs-lsd-radix-sort Yet MSD is more complicated to implement and less performant, so if doing MSD-like approach I would go with adaptive MSD like one from boost What do you think about this?
Antoine Pitrou / @pitrou: Good question about bandwidth!
A possible solution for the bandwidth issue may be to 1) run a LSD radix sort on small-sized chunks (for example 64kB or 128kB, to fit in L2) 2) run a merge sort on the resulting sorted chunks. I'm not sure what the performance would be.
Kirill Lykov / @KirillLykov: Well, it looks like it is called "wolf sort" (https://github.com/scandum/wolfsort). I compared the implementation of Igor with Travis lsd and stable sort on uniformly distributed uint64_t from the range 0..1e9, see
I don't know if his implementation is good/bad. I just took it and made it a bit more typed to be compiled by C++ compiler, see https://github.com/KirillLykov/int-sort-bmk/blob/wolfSortCheck/ branch if really interested
Antoine Pitrou / @pitrou: Ha, the search space for sorting algorithms looks a bit crowded :-)
Generally speaking, I don't think there's a need to press forward on this issue. But your experiments and findings are extremely useful and will probably serve us in the future, so thanks a lot!
Antoine Pitrou / @pitrou: Also worth taking a look at https://github.com/scandum/quadsort , even if it's not radix-based.
For integer arrays with a non-tiny range of values, we currently use a stable sort. It may be faster to use a radix sort instead.
Reporter: Antoine Pitrou / @pitrou
Original Issue Attachments:
Note: This issue was originally created as ARROW-10899. Please see the migration documentation for further details.