Open gitboy16 opened 2 years ago
I use the default sorting algorithm for each language. Folks who use Python and want performant code typically use Numpy arrays rather than Python lists, so I included python/numpy as an additional system to benchmark.
C++/boost seems like another system that would be reasonable to include. I'd be happy to review a pull request adding it.
My benchmark machine has two physical cores and four logical cores, so it is possible multithreading could result in a speedup. For comparison, sorting in Julia 1.9 operates on a single thread.
According to a benchmark without source code provided, this algorithm is 4x faster than the default sorting algorithm for 10^6 elements. That would place it between Julia-1.7 and Julia-1.9.
I don't think I understand your benchmark. It seems you are measuring a lot more than just the time it takes to sort a vector. I tested the sorting algorithm of Julia 1.9 vs Boost on large data (i.e. above 1e8, I usually work with ~1e9 data which are floating output of models between 0 and 1) and it was behind in single threaded mode. I use boost in multithreading mode because waiting 30 to 40 seconds to sort the data is too slow.
I don't think I understand your benchmark. It seems you are measuring a lot more than just the time it takes to sort a vector.
I explain this in the README:
I compute the difference between runtime with and without sorting to infer the sorting runtime. For high performing languages on this benchmark, the vast majority of the time is spent sorting (typically 80-90%).
it was behind in single threaded mode
I'd be happy to add your results to the graph on the README if you make a PR to this repository.
I have tested on Julia 1.9 commit 799a2cfab5 vs Rcpp. There is probably an overhead of executing c++ via R 3.6.1 (but it is so nice, easily and reproducible that it is worth the cost)
x = rand(1_000);
@btime sort(x)
The code for Rcpp is the following
// [[Rcpp::depends(BH)]]
#include <Rcpp.h>
#include <boost/sort/sort.hpp>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector boostsort(NumericVector x, uint32_t con_threads) {
NumericVector y = clone(x);
boost::sort::block_indirect_sort(y.begin(), y.end(), con_threads);
return y;
}
Note that above I am cloning the input vector so it is an additional step I am measuring as well in my timing of boost sort.
To run the above in R:
x = runif(1e3)
Rcpp::sourceCpp("boost_sort.cpp")
microbenchmark::microbenchmark(boostsort(x,1L), times =5L)
I have tested for the following size: 1e6, 1e7, 1e8 Julia timings: 53.3 ms, 596 ms, 5.9 s C++ timings: 35.5 ms, 410 ms, 4.77 s
Personally I am more interested in multi-threaded sorting on really large data and that is where boost shines. Fyi, I have run the above on a desktop, and couldn't do 1e9 because not enough memory for that. I couldn't test on the server I use because I am not allowed to install Julia there. But at least now you have the code to benchmark! (The c++ version could probably be improved if done in pure c++)
Thanks! Running c++ through R is an interesting approach to achieving reproducibility.
I got three errors in R which I solved by using
install.packages("Rcpp")
install.packages("BH")
install.packages("microbenchmark")
Which I got past by running this all in a directory that contains no files except for boost_sort.cpp
.
Now I am stuck on a third class of error
> x = runif(1e3)
> Rcpp::sourceCpp("boost_sort.cpp")
> microbenchmark::microbenchmark(boost_sort(x,1L), times =5L)
Error in boost_sort(x, 1L) : could not find function "boost_sort"
Beyond the fact that I can't reproduce this (yet), there is the issue that this sorting does not do real work (i.e. a smart enough compiler could realize that it doesn't actually matter whether or not the sort takes place) and the issue that this repeatedly sorts the same vector. Hopefully both of these issues are minor, but in the interest of fairness, I take a slightly more robust approach for the benchmarks in this repo (see README.md for details).
Sorry, please replace boost_sort by boostsort in the R part. I have edited my comment above. From the code I read in your repo I don't think your approach is fair because you are measuring more than the time it takes to sort a vector (for example you are measuring the time it takes to create the vector and this can varie a lot from one language to anoyher)...but that is just my view...
Manage to time both algo with 1e9 of data. Julia took 57.41s vs Boost 56.78s which is pretty impressive for a dynamic language. I noticed however a higher pick in memory consumption on the Julia side. Using 8 threads, Boost takes 15.30s. For your info, we have an in-house algo that takes 7.21s and we think there is room for improvement.
As I explain in the readme, I measure the time with sorting + other things and the time with just the other things. I subtract the two to get a figure for sorting time. This method takes into account the total impact of a sorting operation on the execution of a program. For example, a fast sorting algorithm that has a large garbage collection impact will be accurately penalized with my benchmarking strategy, whereas standard practice for microbenchmarks may mask this cost by running the garbage collector between samples.
Further, for fast languages (Jula and Java were the two I was looking at when I made this determination, but it should probably also hold for C++), sorting takes substantially more time than generating the vectors. This is because it is (theoretically) O(n log n) vs O(n). IIRC, 80-90% of time is spent sorting, but I compensate for that 10-20% overcount fairly well by subtracting out the runtime of everything but sorting.
As I note in the code, for C++ this subtraction results in a strange bug (PR welcome if you can identify how to fix it), so rather than subtracting 10-20% as I do for Julia, I simply subtract 15%. This would be meaningfully unfair if C++ is substantially slower than Julia at generating random numbers. I hope this is not the case.
I have finally been able to reproduce your benchmarks! Thanks for your help. Interestinglly, I get different results on my computer:
julia> for i in 1:8
x = rand(10^i)
@btime sort(x)
end
106.970 ns (1 allocation: 144 bytes)
2.549 μs (1 allocation: 896 bytes)
18.995 μs (4 allocations: 18.09 KiB)
199.390 μs (6 allocations: 164.56 KiB)
2.141 ms (6 allocations: 1.53 MiB)
28.287 ms (6 allocations: 15.27 MiB)
317.828 ms (6 allocations: 152.60 MiB)
4.349 s (6 allocations: 1.49 GiB)
> Rcpp::sourceCpp("boost_sort.cpp")
> for (i in 1:8) {
+ x = runif(10^i)
+ print(microbenchmark::microbenchmark(boostsort(x,1L), times =5L))
+ }
Unit: microseconds
expr min lq mean median uq max neval
boostsort(x, 1L) 2.581 2.856 29.0578 2.869 8.289 128.694 5
Unit: microseconds
expr min lq mean median uq max neval
boostsort(x, 1L) 4.233 4.72 10.5894 4.747 6.346 32.901 5
Unit: microseconds
expr min lq mean median uq max neval
boostsort(x, 1L) 63.525 72.491 327.7388 84.772 150.79 1267.116 5
Unit: microseconds
expr min lq mean median uq max neval
boostsort(x, 1L) 511.313 519.277 538.975 522.803 564.231 577.251 5
Unit: milliseconds
expr min lq mean median uq max neval
boostsort(x, 1L) 5.105957 5.261282 5.896635 5.611892 6.207899 7.296144 5
Unit: milliseconds
expr min lq mean median uq max neval
boostsort(x, 1L) 46.73087 48.50289 56.96083 51.99128 65.22243 72.35669 5
Unit: milliseconds
expr min lq mean median uq max neval
boostsort(x, 1L) 508.3247 538.7383 538.4224 539.5442 547.0991 558.4059 5
Unit: seconds
expr min lq mean median uq max neval
boostsort(x, 1L) 8.484083 9.013505 9.442578 9.732026 9.95213 10.03114 5
As an aside, what bitwidiths are you using? I can't fit 10^9 Float64s in ram on my computer, but sorting 10^9 Float16s takes me 3.8 seconds on Julia c904ef372c81530ffbde63677010f1aa39018ed5. If you are willing to work with Float16s, they are quite easy to sort at scale. Julia uses algorithms from the 1950s that are easily parallelized, but still only runs them on a single thread. There is certainly room for improvement.
I am using float64. Very strange that benchmarks are so different 🤔. I would love to see Julia running that fast vs c++ boost on my computer!
Tried to run your benchmark again but don't see such a difference on my desktop, actually boost is faster on my machine for i = 6:8. Which versiin of r and rtools are you using? In your benchmark above, I noticed that Julia 1.8rc3 is slightly faster than 1.9 for i=1:3.
It seems plausible to me that it is due to hardware differences. What type of hardware are you running?
Which versiin of r and rtools are you using?
I recently got into a spat with my AI assistant which ran rm -rf /
, so I don't know what version of R and Rtools I was previously using, but I redownloaded R and the relevant libraries today and the performance gap remains (version info included in the provided session, should be latest).
I am running on windows 10. Julia version 1.9 dev 1022 commit 799a2cfab5. CPU: 8 x Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz. Could it be that your version of Julia includes improvement that are not included in mine? I have tried R 4.2.1 and it is actually a bit faster that R 3.6.1.
Interesting. Your clock speed is twice mine, and yet Julia runs twice as fast on my computer. It may be a performance issue in the interaction between Julia's algorithms and your hardware. Your Julia version is 6 days old and nothing that could explain this 4x discrepancy has changed in the last 6 days.
CPU info is helpful, but comprehensive system specs would be nice. e.g. what I posted above:
Model Name: MacBook Air
Model Identifier: MacBookAir8,2
Processor Name: Dual-Core Intel Core i5
Processor Speed: 1.6 GHz
Number of Processors: 1
Total Number of Cores: 2
L2 Cache (per Core): 256 KB
L3 Cache: 4 MB
Hyper-Threading Technology: Enabled
Memory: 8 GB
System Firmware Version: 1715.81.2.0.0 (iBridge: 19.16.10744.0.0,0)
OS Loader Version: 540.80.2~11
Another option (probably better) would be to use SystemBenchmarks.jl to attempt to capture general interactions between Julia and your machine. My results are here: https://github.com/IanButterworth/SystemBenchmark.jl/issues/8#issuecomment-1197475145 and instructions for generating them are at the top of the thread.
Please see attached. Note I can directly log in to github with this machine, so i am uploading a photo.
And while I am at it, i attach the benchmark with R and Julia.
An initial guess I have is that your machine is about 2x faster in most ways, but mine has 2x faster 100kB memory bandwidth which becomes a bottleneck on your computer.
I'll look for a fast computer with slow 100kB memory access to test with, but without being able to profile and fiddle, I don't think I'll be able to diagnose this performance issue remotely.
You could try fiddling around with chunk size. For example, the default chunk size of heuristic results in a 10 bit chunk and a 8kB counts vector at most of these sizes which may be too large for your machine. You can override these defaults like so:
Base.Sort.radix_chunk_size_heuristic(::Integer, ::Integer, ::Unsigned) = UInt8(9)
Hi, would you consider porting julia sorting algorithm to a new package and adding multi-threading to it? (Not sure you can add this directly to core Julia?)
To make a good multithreaded sorting algorithm I'd need easy access to a many-cored machine to develop and benchmark on. Unfortunately, that is not something I have right now, and I am prioritizing other improvments.
Hi, is there a way to extract the sort function (and algorithms) from base Julia in order to improve them and build a multi-threaded version? Also if I want to fiddle with chunck size it is not easy to do it in base Julia.
Julia's sorting functions and algorithms live at https://github.com/JuliaLang/julia/blob/master/base/sort.jl. You can experiment with redefining sorting internals by redefining methods. For example, this is how I would make Base.Sort
's radix sort use a chunk size of 2 bits:
Base.Sort.radix_chunk_size_heuristic(lo::Integer, hi::Integer, bits::Unsigned) = 0x02
Here is a more complex example that reveals a weakness in the existing heuristic:
julia> using BenchmarkTools
julia> @btime sort!(rand(10000));
94.583 μs (5 allocations: 164.53 KiB)
julia> const CHUNK_SIZE = Ref(0x00);
julia> Base.Sort.radix_chunk_size_heuristic(lo::Integer, hi::Integer, bits::Unsigned) = CHUNK_SIZE[]
julia> for i in 1:20
CHUNK_SIZE[] = i
print(lpad(i,2)); @btime sort!(rand(10000));
end
1 1.987 ms (5 allocations: 156.42 KiB)
2 547.792 μs (5 allocations: 156.44 KiB)
3 322.750 μs (5 allocations: 156.47 KiB)
4 219.750 μs (5 allocations: 156.53 KiB)
5 201.000 μs (5 allocations: 156.67 KiB)
6 169.333 μs (5 allocations: 156.91 KiB)
7 114.250 μs (5 allocations: 157.41 KiB)
8 101.209 μs (5 allocations: 158.53 KiB)
9 132.375 μs (5 allocations: 160.53 KiB)
10 93.875 μs (5 allocations: 164.53 KiB)
11 123.750 μs (6 allocations: 172.45 KiB)
12 91.417 μs (6 allocations: 188.45 KiB)
13 125.542 μs (6 allocations: 220.45 KiB)
14 123.167 μs (6 allocations: 284.45 KiB)
15 171.000 μs (6 allocations: 412.45 KiB)
16 244.000 μs (6 allocations: 668.45 KiB)
17 384.250 μs (6 allocations: 1.15 MiB)
18 670.958 μs (6 allocations: 2.15 MiB)
19 1.010 ms (6 allocations: 4.15 MiB)
20 2.072 ms (6 allocations: 8.15 MiB)
Apparently, a chunk size of 12 is about 3% faster than a chunk size of 10 in this case.
I have slightly changed your code to always sort the same random vector.
x = rand(10^8);
for i in 1:20
CHUNK_SIZE[] = i
print(lpad(i,2));
@btime sort($x);
end
This is the results I get.
It seems that on your machine a chunk size of 10 is 1.6x faster than a chunk size of 12. Looking at these benchmark results, the default chunk size of 10 in this case seems like a great choice for both of our machines :).
Aside: please try to share text rather than physical pictures of a screen.
Sorry for that, but I don't have access to github from that machine and it is a bit of a pain to type everything by hand. I have restricted internet connection.
Hi, Is the purpose of this repo to compare the best sorting algo in each language? If so, would you be able to add C++ boost block_indirect_sort to you benchmark? It is multi-threaded as well. Thank you Best regards