Closed Sanmayce closed 2 years ago
I may have found a way to successfully port it into our program. I do need to know, though, does this visual seem correct?
Congratulations, you appear to have reinvented ternary quicksort with Yaroslavskiy partitioning and "middle element" pivot selection. Given that, the visualisation looks correct to me.
@PCBoyGames Many thanks, love both your video and sound. Yes, it works correctly. I wish people to have practical experience, not the "university" style of not going into details, as we all know, the devil is in the detail, or rather, the detail is the devil that works, heh-heh.
@chromi Thanks, not sure, the only PDF/paper I know of is: https://web.archive.org/web/20151002230717/http://iaroslavski.narod.ru/quicksort/DualPivotQuicksort.pdf
Does he have single-pivot variant along with dual-pivot one? My Magnetica uses single-pivot.
Both ternary and dual-pivot quicksort create three partitions, just on slightly different criteria. Your method of doing so is the same, and characteristically leaves the equal-to-pivot values in the middle so they don't have to be swapped into position again.
I have a version of quicksort that selects two pivots (using a random sampling method, which avoids quadratic pathologies much better than any fixed choice), then checks whether they are equal to each other. If they are, it does a ternary partition very like yours. If not, it does a dual-pivot partition very like Yaroslavskiy's. This tends to obtain the best qualities of both.
It seems you have superior strategy, bravo. As for the random choosing, I avoid it since benchmarking becomes not "stable".
You could choose pivots deterministically as well, for example by taking the 0/7, 1/7, …, 7/7 positioned elements, sorting them (insertion sort is good enough), then choosing the 3rd and 6th ranked as two pivots (ie. tertiles of 8). Or the 0/4, ¼, ½, ¾, 4/4 positioned, then the 3rd ranked for a single pivot (ie. median of 5). It is still possible to deliberately create inputs that provoke quadratic behaviour with this (eg. ArrayV has an "AntiQSort" mode that generates them), but you should find fewer cases where your pivot is wildly out of place on average.
Or, for benchmarking, you could seed the RNG with a consistent value, but switch to a truly random seed in production. Then instead of picking samples from fixed positions, you use the RNG to choose them, then proceed as above.
Finally, you will want to switch from quicksort to a simpler algorithm for small partitions, for which the work of selecting a good pivot and setting up the partitioning algorithm becomes a large overhead. Insertion sort becomes superior below about a dozen elements. Shellsort with a good choice of gap sequence could be a win for as much as 4000 elements:
size_t a=0x400, b=0x30, gap;
do {
gap = a+b+1; a /= 4; b /= 2;
for(size_t i = gap; i < A.size(); i++)
for(size_t j = i; j >= gap && A[j] < A[j-gap]; j -= gap)
std::swap(A[j], A[j-gap]);
} while(gap > 1);
The above uses a modified Sedgewick 1982 sequence: 1073, 281, 77, 23, 8, 3, 1.
@chromi Thanks for the ideas.
you will want to switch from quicksort to a simpler algorithm for small partitions, Indeed, this is a must, however my wish was the initial Magnetica to be as simple as possible, to serve as a baseline.
Insertion sort becomes superior below about a dozen elements. As far as I see, the magic number with Magnetica (surely for scenarios with few (or not many) distinct keys) should be bigger, my intent is to fix it as 19, but first will benchmark on two CPUs using GCC 10.2.1 and Intel v15.0, that is, to see whether modern hardware and compilers "allow" 19.
Just found this gold benchmark/source mine: https://stackoverflow.com/questions/2786899/fastest-sort-of-fixed-length-6-int-array?rq=1
Love the branchless etudes, so my intent is to choose Pivot as the middle of 6 or 7 elements ... chosen somehow. LOVE the whole experimental attitude from all the coders there.
What is your favorite among these golden etudes?
Recommendation: Join our Discord, The Studio
In case you want to use the best I could write so far: https://www.qb64.org/forum/index.php?action=dlattach;topic=3518.0;attach=16024
It chooses median of 5 pivot - branchlessly! Magnetica (the C source from above link) compared with qsort from Fedora 33 and BM=quicksort_Bentley_McIlroy_3way_partitioning, gcc 10.2.1 used.
Again, it outperforms both! But, now it is more robust against quadratic behavior - surely will scream with 'Organ Pipe' and such...
On laptop with i5-7200U 3.1GHz max turbo, 36GB DDR4 2133MHz:
[kaze@kaze Quicksort_says_rev4]$ gcc -O3 QS_bench_r4.c -o QS_bench_r4.elf
[kaze@kaze Quicksort_says_rev4]$ perf stat -d ./QS_bench_r4.elf qsort few
Allocating AUX-Buffer 21MB ...
Allocating Master-Buffer 17043MB ...
Sorting in single-thread 2233861800 elements...
Done in 360 seconds.
Checking whether sort went correct... OK. Unique keys = 10
Performance counter stats for './QS_bench_r4.elf qsort few':
374,137.53 msec task-clock:u # 0.999 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
4,363,067 page-faults:u # 0.012 M/sec
1,130,498,069,102 cycles:u # 3.022 GHz (50.00%)
3,279,617,021,921 instructions:u # 2.90 insn per cycle (62.50%)
650,533,917,251 branches:u # 1738.756 M/sec (62.50%)
5,521,100,564 branch-misses:u # 0.85% of all branches (62.50%)
734,144,831,583 L1-dcache-loads:u # 1962.233 M/sec (62.50%)
6,748,748,435 L1-dcache-load-misses:u # 0.92% of all L1-dcache accesses (62.50%)
401,878,659 LLC-loads:u # 1.074 M/sec (50.00%)
106,122,616 LLC-load-misses:u # 26.41% of all LL-cache accesses (50.00%)
374.678334228 seconds time elapsed
366.210023000 seconds user
5.690335000 seconds sys
[kaze@kaze Quicksort_says_rev4]$ perf stat -d ./QS_bench_r4.elf Magnetica few
Allocating AUX-Buffer 21MB ...
Allocating Master-Buffer 17043MB ...
Sorting in single-thread 2233861800 elements...
Done in 35 seconds.
Checking whether sort went correct... OK. Unique keys = 10
Performance counter stats for './QS_bench_r4.elf Magnetica few':
48,568.57 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
4,363,068 page-faults:u # 0.090 M/sec
131,511,912,878 cycles:u # 2.708 GHz (50.00%)
112,398,266,721 instructions:u # 0.85 insn per cycle (62.50%)
24,083,136,669 branches:u # 495.858 M/sec (62.50%)
3,277,693,748 branch-misses:u # 13.61% of all branches (62.50%)
13,332,653,082 L1-dcache-loads:u # 274.512 M/sec (62.50%)
1,657,129,638 L1-dcache-load-misses:u # 12.43% of all L1-dcache accesses (62.50%)
61,079,406 LLC-loads:u # 1.258 M/sec (50.00%)
52,834,258 LLC-load-misses:u # 86.50% of all LL-cache accesses (50.00%)
48.592132565 seconds time elapsed
42.699590000 seconds user
5.537604000 seconds sys
[kaze@kaze Quicksort_says_rev4]$ perf stat -d ./QS_bench_r4.elf BM few
Allocating AUX-Buffer 21MB ...
Allocating Master-Buffer 17043MB ...
Sorting in single-thread 2233861800 elements...
Done in 40 seconds.
Checking whether sort went correct... OK. Unique keys = 10
Performance counter stats for './QS_bench_r4.elf BM few':
54,275.83 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
4,363,067 page-faults:u # 0.080 M/sec
148,622,728,821 cycles:u # 2.738 GHz (49.99%)
141,790,933,565 instructions:u # 0.95 insn per cycle (62.49%)
29,021,159,581 branches:u # 534.698 M/sec (62.50%)
2,945,037,342 branch-misses:u # 10.15% of all branches (62.51%)
20,760,966,368 L1-dcache-loads:u # 382.509 M/sec (62.51%)
2,254,743,473 L1-dcache-load-misses:u # 10.86% of all L1-dcache accesses (62.51%)
206,505,163 LLC-loads:u # 3.805 M/sec (50.00%)
97,481,857 LLC-load-misses:u # 47.21% of all LL-cache accesses (49.99%)
54.295299616 seconds time elapsed
48.359850000 seconds user
5.552888000 seconds sys
[kaze@kaze Quicksort_says_rev4]$ perf stat -d ./QS_bench_r4.elf qsort many
Allocating FILE-Buffer 23MB ...
Allocating AUX-Buffer 189MB ...
Allocating Master-Buffer 18938MB ...
Sorting in single-thread 2482300900 elements...
Done in 535 seconds.
Checking whether sort went correct... OK. Unique keys = 2847531
Performance counter stats for './QS_bench_r4.elf qsort many':
549,999.00 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
4,896,780 page-faults:u # 0.009 M/sec
1,668,181,170,985 cycles:u # 3.033 GHz (50.00%)
2,936,904,303,452 instructions:u # 1.76 insn per cycle (62.50%)
656,551,483,104 branches:u # 1193.732 M/sec (62.50%)
24,993,461,483 branch-misses:u # 3.81% of all branches (62.50%)
675,411,011,343 L1-dcache-loads:u # 1228.022 M/sec (62.50%)
8,360,278,671 L1-dcache-load-misses:u # 1.24% of all L1-dcache accesses (62.50%)
224,636,221 LLC-loads:u # 0.408 M/sec (50.00%)
128,684,015 LLC-load-misses:u # 57.29% of all LL-cache accesses (50.00%)
550.260731723 seconds time elapsed
540.337133000 seconds user
6.272608000 seconds sys
[kaze@kaze Quicksort_says_rev4]$ perf stat -d ./QS_bench_r4.elf Magnetica many
Allocating FILE-Buffer 23MB ...
Allocating AUX-Buffer 189MB ...
Allocating Master-Buffer 18938MB ...
Sorting in single-thread 2482300900 elements...
Done in 229 seconds.
Checking whether sort went correct... OK. Unique keys = 2847531
Performance counter stats for './QS_bench_r4.elf Magnetica many':
244,034.76 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
4,896,784 page-faults:u # 0.020 M/sec
728,706,789,305 cycles:u # 2.986 GHz (50.00%)
801,207,820,988 instructions:u # 1.10 insn per cycle (62.50%)
152,404,608,907 branches:u # 624.520 M/sec (62.50%)
18,717,429,150 branch-misses:u # 12.28% of all branches (62.50%)
105,275,157,071 L1-dcache-loads:u # 431.394 M/sec (62.50%)
9,759,976,614 L1-dcache-load-misses:u # 9.27% of all L1-dcache accesses (62.50%)
262,730,414 LLC-loads:u # 1.077 M/sec (50.00%)
220,216,469 LLC-load-misses:u # 83.82% of all LL-cache accesses (50.00%)
244.123151814 seconds time elapsed
236.191392000 seconds user
6.118416000 seconds sys
[kaze@kaze Quicksort_says_rev4]$ perf stat -d ./QS_bench_r4.elf BM many
Allocating FILE-Buffer 23MB ...
Allocating AUX-Buffer 189MB ...
Allocating Master-Buffer 18938MB ...
Sorting in single-thread 2482300900 elements...
Done in 230 seconds.
Checking whether sort went correct... OK. Unique keys = 2847531
Performance counter stats for './QS_bench_r4.elf BM many':
245,002.86 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
4,896,781 page-faults:u # 0.020 M/sec
731,986,011,848 cycles:u # 2.988 GHz (50.00%)
655,933,578,244 instructions:u # 0.90 insn per cycle (62.50%)
165,775,989,922 branches:u # 676.629 M/sec (62.50%)
18,932,375,613 branch-misses:u # 11.42% of all branches (62.50%)
85,130,377,588 L1-dcache-loads:u # 347.467 M/sec (62.50%)
8,621,483,859 L1-dcache-load-misses:u # 10.13% of all L1-dcache accesses (62.50%)
228,200,878 LLC-loads:u # 0.931 M/sec (50.00%)
163,363,682 LLC-load-misses:u # 71.59% of all LL-cache accesses (50.00%)
245.112888312 seconds time elapsed
237.289127000 seconds user
6.076702000 seconds sys
[kaze@kaze Quicksort_says_rev4]$ perf stat -d ./QS_bench_r4.elf qsort ALL
Allocating FILE-Buffer 1916MB ...
Allocating AUX-Buffer 15329MB ...
Allocating Master-Buffer 15329MB ...
Sorting in single-thread 2009333753 elements...
Done in 520 seconds.
Checking whether sort went correct... OK. Unique keys = 1912608132
Performance counter stats for './QS_bench_r4.elf qsort ALL':
543,037.81 msec task-clock:u # 0.993 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
7,849,013 page-faults:u # 0.014 M/sec
1,623,485,482,541 cycles:u # 2.990 GHz (50.00%)
2,271,412,145,371 instructions:u # 1.40 insn per cycle (62.50%)
537,049,261,949 branches:u # 988.972 M/sec (62.50%)
28,403,315,111 branch-misses:u # 5.29% of all branches (62.50%)
520,083,072,189 L1-dcache-loads:u # 957.729 M/sec (62.50%)
6,711,300,111 L1-dcache-load-misses:u # 1.29% of all L1-dcache accesses (62.50%)
179,157,605 LLC-loads:u # 0.330 M/sec (50.00%)
101,853,848 LLC-load-misses:u # 56.85% of all LL-cache accesses (50.00%)
547.023270509 seconds time elapsed
526.879143000 seconds user
12.702404000 seconds sys
[kaze@kaze Quicksort_says_rev4]$ perf stat -d ./QS_bench_r4.elf Magnetica ALL
Allocating FILE-Buffer 1916MB ...
Allocating AUX-Buffer 15329MB ...
Allocating Master-Buffer 15329MB ...
Sorting in single-thread 2009333753 elements...
Done in 302 seconds.
Checking whether sort went correct... OK. Unique keys = 1912608132
Performance counter stats for './QS_bench_r4.elf Magnetica ALL':
324,320.50 msec task-clock:u # 0.991 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
7,849,024 page-faults:u # 0.024 M/sec
953,068,016,151 cycles:u # 2.939 GHz (50.00%)
1,078,558,010,841 instructions:u # 1.13 insn per cycle (62.50%)
202,072,815,438 branches:u # 623.065 M/sec (62.49%)
24,712,860,248 branch-misses:u # 12.23% of all branches (62.50%)
143,013,173,669 L1-dcache-loads:u # 440.962 M/sec (62.50%)
7,341,720,240 L1-dcache-load-misses:u # 5.13% of all L1-dcache accesses (62.51%)
206,349,700 LLC-loads:u # 0.636 M/sec (50.00%)
171,267,094 LLC-load-misses:u # 83.00% of all LL-cache accesses (50.00%)
327.277372069 seconds time elapsed
309.871905000 seconds user
12.146404000 seconds sys
[kaze@kaze Quicksort_says_rev4]$ perf stat -d ./QS_bench_r4.elf BM ALL
Allocating FILE-Buffer 1916MB ...
Allocating AUX-Buffer 15329MB ...
Allocating Master-Buffer 15329MB ...
Sorting in single-thread 2009333753 elements...
Done in 312 seconds.
Checking whether sort went correct... OK. Unique keys = 1912608132
Performance counter stats for './QS_bench_r4.elf BM ALL':
336,008.08 msec task-clock:u # 0.994 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
7,849,047 page-faults:u # 0.023 M/sec
987,096,373,815 cycles:u # 2.938 GHz (50.00%)
896,653,900,788 instructions:u # 0.91 insn per cycle (62.50%)
226,954,097,302 branches:u # 675.442 M/sec (62.50%)
26,002,237,784 branch-misses:u # 11.46% of all branches (62.50%)
119,818,264,279 L1-dcache-loads:u # 356.593 M/sec (62.50%)
7,935,978,293 L1-dcache-load-misses:u # 6.62% of all L1-dcache accesses (62.50%)
198,601,062 LLC-loads:u # 0.591 M/sec (50.00%)
148,030,944 LLC-load-misses:u # 74.54% of all LL-cache accesses (50.00%)
338.139378929 seconds time elapsed
320.860053000 seconds user
12.847897000 seconds sys
[kaze@kaze Quicksort_says_rev4]$
@Gaming32 Don't know, it would be good to compare them, but from what I see at first glance, Magnetica is superior.
I have sent the algorithm in as part of a pull request to the ArrayV Extra Sorts Pack. Keep an eye out: Gaming32/ArrayV-Extra-Sorts#1
Hi, please add to your beautiful collection the fastest (known to me) Quicksort: https://www.qb64.org/forum/index.php?topic=3518.msg138244#msg138244