Quicksort 'Magnetica' submission

Sanmayce commented 2 years ago

Hi, please add to your beautiful collection the fastest (known to me) Quicksort: https://www.qb64.org/forum/index.php?topic=3518.msg138244#msg138244

PCBoyGames commented 2 years ago

I may have found a way to successfully port it into our program. I do need to know, though, does this visual seem correct?

chromi commented 2 years ago

Congratulations, you appear to have reinvented ternary quicksort with Yaroslavskiy partitioning and "middle element" pivot selection. Given that, the visualisation looks correct to me.

Sanmayce commented 2 years ago

@PCBoyGames Many thanks, love both your video and sound. Yes, it works correctly. I wish people to have practical experience, not the "university" style of not going into details, as we all know, the devil is in the detail, or rather, the detail is the devil that works, heh-heh.

@chromi Thanks, not sure, the only PDF/paper I know of is: https://web.archive.org/web/20151002230717/http://iaroslavski.narod.ru/quicksort/DualPivotQuicksort.pdf

Does he have single-pivot variant along with dual-pivot one? My Magnetica uses single-pivot.

chromi commented 2 years ago

Both ternary and dual-pivot quicksort create three partitions, just on slightly different criteria. Your method of doing so is the same, and characteristically leaves the equal-to-pivot values in the middle so they don't have to be swapped into position again.

I have a version of quicksort that selects two pivots (using a random sampling method, which avoids quadratic pathologies much better than any fixed choice), then checks whether they are equal to each other. If they are, it does a ternary partition very like yours. If not, it does a dual-pivot partition very like Yaroslavskiy's. This tends to obtain the best qualities of both.

Sanmayce commented 2 years ago

It seems you have superior strategy, bravo. As for the random choosing, I avoid it since benchmarking becomes not "stable".

chromi commented 2 years ago

You could choose pivots deterministically as well, for example by taking the 0/7, 1/7, …, 7/7 positioned elements, sorting them (insertion sort is good enough), then choosing the 3rd and 6th ranked as two pivots (ie. tertiles of 8). Or the 0/4, ¼, ½, ¾, 4/4 positioned, then the 3rd ranked for a single pivot (ie. median of 5). It is still possible to deliberately create inputs that provoke quadratic behaviour with this (eg. ArrayV has an "AntiQSort" mode that generates them), but you should find fewer cases where your pivot is wildly out of place on average.

Or, for benchmarking, you could seed the RNG with a consistent value, but switch to a truly random seed in production. Then instead of picking samples from fixed positions, you use the RNG to choose them, then proceed as above.

Finally, you will want to switch from quicksort to a simpler algorithm for small partitions, for which the work of selecting a good pivot and setting up the partitioning algorithm becomes a large overhead. Insertion sort becomes superior below about a dozen elements. Shellsort with a good choice of gap sequence could be a win for as much as 4000 elements:

size_t a=0x400, b=0x30, gap;
do {
    gap = a+b+1; a /= 4; b /= 2;
    for(size_t i = gap; i < A.size(); i++)
        for(size_t j = i; j >= gap && A[j] < A[j-gap]; j -= gap)
            std::swap(A[j], A[j-gap]);
} while(gap > 1);

The above uses a modified Sedgewick 1982 sequence: 1073, 281, 77, 23, 8, 3, 1.

Sanmayce commented 2 years ago

@chromi Thanks for the ideas.

you will want to switch from quicksort to a simpler algorithm for small partitions, Indeed, this is a must, however my wish was the initial Magnetica to be as simple as possible, to serve as a baseline.

Insertion sort becomes superior below about a dozen elements. As far as I see, the magic number with Magnetica (surely for scenarios with few (or not many) distinct keys) should be bigger, my intent is to fix it as 19, but first will benchmark on two CPUs using GCC 10.2.1 and Intel v15.0, that is, to see whether modern hardware and compilers "allow" 19.

Just found this gold benchmark/source mine: https://stackoverflow.com/questions/2786899/fastest-sort-of-fixed-length-6-int-array?rq=1

Love the branchless etudes, so my intent is to choose Pivot as the middle of 6 or 7 elements ... chosen somehow. LOVE the whole experimental attitude from all the coders there.

What is your favorite among these golden etudes?

Gaming32 commented 2 years ago

Recommendation: Join our Discord, The Studio

Sanmayce commented 2 years ago

In case you want to use the best I could write so far: https://www.qb64.org/forum/index.php?action=dlattach;topic=3518.0;attach=16024

It chooses median of 5 pivot - branchlessly! Magnetica (the C source from above link) compared with qsort from Fedora 33 and BM=quicksort_Bentley_McIlroy_3way_partitioning, gcc 10.2.1 used.

Again, it outperforms both! But, now it is more robust against quadratic behavior - surely will scream with 'Organ Pipe' and such...

On laptop with i5-7200U 3.1GHz max turbo, 36GB DDR4 2133MHz:

[kaze@kaze Quicksort_says_rev4]$ gcc -O3 QS_bench_r4.c -o QS_bench_r4.elf
[kaze@kaze Quicksort_says_rev4]$ perf stat -d ./QS_bench_r4.elf qsort few
Allocating AUX-Buffer 21MB ...
Allocating Master-Buffer 17043MB ...
Sorting in single-thread 2233861800 elements...
Done in 360 seconds.
Checking whether sort went correct... OK. Unique keys = 10

 Performance counter stats for './QS_bench_r4.elf qsort few':

        374,137.53 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,363,067      page-faults:u             #    0.012 M/sec                  
 1,130,498,069,102      cycles:u                  #    3.022 GHz                      (50.00%)
 3,279,617,021,921      instructions:u            #    2.90  insn per cycle           (62.50%)
   650,533,917,251      branches:u                # 1738.756 M/sec                    (62.50%)
     5,521,100,564      branch-misses:u           #    0.85% of all branches          (62.50%)
   734,144,831,583      L1-dcache-loads:u         # 1962.233 M/sec                    (62.50%)
     6,748,748,435      L1-dcache-load-misses:u   #    0.92% of all L1-dcache accesses  (62.50%)
       401,878,659      LLC-loads:u               #    1.074 M/sec                    (50.00%)
       106,122,616      LLC-load-misses:u         #   26.41% of all LL-cache accesses  (50.00%)

     374.678334228 seconds time elapsed

     366.210023000 seconds user
       5.690335000 seconds sys

[kaze@kaze Quicksort_says_rev4]$ perf stat -d ./QS_bench_r4.elf Magnetica few
Allocating AUX-Buffer 21MB ...
Allocating Master-Buffer 17043MB ...
Sorting in single-thread 2233861800 elements...
Done in 35 seconds.
Checking whether sort went correct... OK. Unique keys = 10

 Performance counter stats for './QS_bench_r4.elf Magnetica few':

         48,568.57 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,363,068      page-faults:u             #    0.090 M/sec                  
   131,511,912,878      cycles:u                  #    2.708 GHz                      (50.00%)
   112,398,266,721      instructions:u            #    0.85  insn per cycle           (62.50%)
    24,083,136,669      branches:u                #  495.858 M/sec                    (62.50%)
     3,277,693,748      branch-misses:u           #   13.61% of all branches          (62.50%)
    13,332,653,082      L1-dcache-loads:u         #  274.512 M/sec                    (62.50%)
     1,657,129,638      L1-dcache-load-misses:u   #   12.43% of all L1-dcache accesses  (62.50%)
        61,079,406      LLC-loads:u               #    1.258 M/sec                    (50.00%)
        52,834,258      LLC-load-misses:u         #   86.50% of all LL-cache accesses  (50.00%)

      48.592132565 seconds time elapsed

      42.699590000 seconds user
       5.537604000 seconds sys

[kaze@kaze Quicksort_says_rev4]$ perf stat -d ./QS_bench_r4.elf BM few
Allocating AUX-Buffer 21MB ...
Allocating Master-Buffer 17043MB ...
Sorting in single-thread 2233861800 elements...
Done in 40 seconds.
Checking whether sort went correct... OK. Unique keys = 10

 Performance counter stats for './QS_bench_r4.elf BM few':

         54,275.83 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,363,067      page-faults:u             #    0.080 M/sec                  
   148,622,728,821      cycles:u                  #    2.738 GHz                      (49.99%)
   141,790,933,565      instructions:u            #    0.95  insn per cycle           (62.49%)
    29,021,159,581      branches:u                #  534.698 M/sec                    (62.50%)
     2,945,037,342      branch-misses:u           #   10.15% of all branches          (62.51%)
    20,760,966,368      L1-dcache-loads:u         #  382.509 M/sec                    (62.51%)
     2,254,743,473      L1-dcache-load-misses:u   #   10.86% of all L1-dcache accesses  (62.51%)
       206,505,163      LLC-loads:u               #    3.805 M/sec                    (50.00%)
        97,481,857      LLC-load-misses:u         #   47.21% of all LL-cache accesses  (49.99%)

      54.295299616 seconds time elapsed

      48.359850000 seconds user
       5.552888000 seconds sys

[kaze@kaze Quicksort_says_rev4]$ perf stat -d ./QS_bench_r4.elf qsort many
Allocating FILE-Buffer 23MB ...
Allocating AUX-Buffer 189MB ...
Allocating Master-Buffer 18938MB ...
Sorting in single-thread 2482300900 elements...
Done in 535 seconds.
Checking whether sort went correct... OK. Unique keys = 2847531

 Performance counter stats for './QS_bench_r4.elf qsort many':

        549,999.00 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,896,780      page-faults:u             #    0.009 M/sec                  
 1,668,181,170,985      cycles:u                  #    3.033 GHz                      (50.00%)
 2,936,904,303,452      instructions:u            #    1.76  insn per cycle           (62.50%)
   656,551,483,104      branches:u                # 1193.732 M/sec                    (62.50%)
    24,993,461,483      branch-misses:u           #    3.81% of all branches          (62.50%)
   675,411,011,343      L1-dcache-loads:u         # 1228.022 M/sec                    (62.50%)
     8,360,278,671      L1-dcache-load-misses:u   #    1.24% of all L1-dcache accesses  (62.50%)
       224,636,221      LLC-loads:u               #    0.408 M/sec                    (50.00%)
       128,684,015      LLC-load-misses:u         #   57.29% of all LL-cache accesses  (50.00%)

     550.260731723 seconds time elapsed

     540.337133000 seconds user
       6.272608000 seconds sys

[kaze@kaze Quicksort_says_rev4]$ perf stat -d ./QS_bench_r4.elf Magnetica many
Allocating FILE-Buffer 23MB ...
Allocating AUX-Buffer 189MB ...
Allocating Master-Buffer 18938MB ...
Sorting in single-thread 2482300900 elements...
Done in 229 seconds.
Checking whether sort went correct... OK. Unique keys = 2847531

 Performance counter stats for './QS_bench_r4.elf Magnetica many':

        244,034.76 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,896,784      page-faults:u             #    0.020 M/sec                  
   728,706,789,305      cycles:u                  #    2.986 GHz                      (50.00%)
   801,207,820,988      instructions:u            #    1.10  insn per cycle           (62.50%)
   152,404,608,907      branches:u                #  624.520 M/sec                    (62.50%)
    18,717,429,150      branch-misses:u           #   12.28% of all branches          (62.50%)
   105,275,157,071      L1-dcache-loads:u         #  431.394 M/sec                    (62.50%)
     9,759,976,614      L1-dcache-load-misses:u   #    9.27% of all L1-dcache accesses  (62.50%)
       262,730,414      LLC-loads:u               #    1.077 M/sec                    (50.00%)
       220,216,469      LLC-load-misses:u         #   83.82% of all LL-cache accesses  (50.00%)

     244.123151814 seconds time elapsed

     236.191392000 seconds user
       6.118416000 seconds sys

[kaze@kaze Quicksort_says_rev4]$ perf stat -d ./QS_bench_r4.elf BM many
Allocating FILE-Buffer 23MB ...
Allocating AUX-Buffer 189MB ...
Allocating Master-Buffer 18938MB ...
Sorting in single-thread 2482300900 elements...
Done in 230 seconds.
Checking whether sort went correct... OK. Unique keys = 2847531

 Performance counter stats for './QS_bench_r4.elf BM many':

        245,002.86 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,896,781      page-faults:u             #    0.020 M/sec                  
   731,986,011,848      cycles:u                  #    2.988 GHz                      (50.00%)
   655,933,578,244      instructions:u            #    0.90  insn per cycle           (62.50%)
   165,775,989,922      branches:u                #  676.629 M/sec                    (62.50%)
    18,932,375,613      branch-misses:u           #   11.42% of all branches          (62.50%)
    85,130,377,588      L1-dcache-loads:u         #  347.467 M/sec                    (62.50%)
     8,621,483,859      L1-dcache-load-misses:u   #   10.13% of all L1-dcache accesses  (62.50%)
       228,200,878      LLC-loads:u               #    0.931 M/sec                    (50.00%)
       163,363,682      LLC-load-misses:u         #   71.59% of all LL-cache accesses  (50.00%)

     245.112888312 seconds time elapsed

     237.289127000 seconds user
       6.076702000 seconds sys

[kaze@kaze Quicksort_says_rev4]$ perf stat -d ./QS_bench_r4.elf qsort ALL
Allocating FILE-Buffer 1916MB ...
Allocating AUX-Buffer 15329MB ...
Allocating Master-Buffer 15329MB ...
Sorting in single-thread 2009333753 elements...
Done in 520 seconds.
Checking whether sort went correct... OK. Unique keys = 1912608132

 Performance counter stats for './QS_bench_r4.elf qsort ALL':

        543,037.81 msec task-clock:u              #    0.993 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         7,849,013      page-faults:u             #    0.014 M/sec                  
 1,623,485,482,541      cycles:u                  #    2.990 GHz                      (50.00%)
 2,271,412,145,371      instructions:u            #    1.40  insn per cycle           (62.50%)
   537,049,261,949      branches:u                #  988.972 M/sec                    (62.50%)
    28,403,315,111      branch-misses:u           #    5.29% of all branches          (62.50%)
   520,083,072,189      L1-dcache-loads:u         #  957.729 M/sec                    (62.50%)
     6,711,300,111      L1-dcache-load-misses:u   #    1.29% of all L1-dcache accesses  (62.50%)
       179,157,605      LLC-loads:u               #    0.330 M/sec                    (50.00%)
       101,853,848      LLC-load-misses:u         #   56.85% of all LL-cache accesses  (50.00%)

     547.023270509 seconds time elapsed

     526.879143000 seconds user
      12.702404000 seconds sys

[kaze@kaze Quicksort_says_rev4]$ perf stat -d ./QS_bench_r4.elf Magnetica ALL
Allocating FILE-Buffer 1916MB ...
Allocating AUX-Buffer 15329MB ...
Allocating Master-Buffer 15329MB ...
Sorting in single-thread 2009333753 elements...
Done in 302 seconds.
Checking whether sort went correct... OK. Unique keys = 1912608132

 Performance counter stats for './QS_bench_r4.elf Magnetica ALL':

        324,320.50 msec task-clock:u              #    0.991 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         7,849,024      page-faults:u             #    0.024 M/sec                  
   953,068,016,151      cycles:u                  #    2.939 GHz                      (50.00%)
 1,078,558,010,841      instructions:u            #    1.13  insn per cycle           (62.50%)
   202,072,815,438      branches:u                #  623.065 M/sec                    (62.49%)
    24,712,860,248      branch-misses:u           #   12.23% of all branches          (62.50%)
   143,013,173,669      L1-dcache-loads:u         #  440.962 M/sec                    (62.50%)
     7,341,720,240      L1-dcache-load-misses:u   #    5.13% of all L1-dcache accesses  (62.51%)
       206,349,700      LLC-loads:u               #    0.636 M/sec                    (50.00%)
       171,267,094      LLC-load-misses:u         #   83.00% of all LL-cache accesses  (50.00%)

     327.277372069 seconds time elapsed

     309.871905000 seconds user
      12.146404000 seconds sys

[kaze@kaze Quicksort_says_rev4]$ perf stat -d ./QS_bench_r4.elf BM ALL
Allocating FILE-Buffer 1916MB ...
Allocating AUX-Buffer 15329MB ...
Allocating Master-Buffer 15329MB ...
Sorting in single-thread 2009333753 elements...
Done in 312 seconds.
Checking whether sort went correct... OK. Unique keys = 1912608132

 Performance counter stats for './QS_bench_r4.elf BM ALL':

        336,008.08 msec task-clock:u              #    0.994 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         7,849,047      page-faults:u             #    0.023 M/sec                  
   987,096,373,815      cycles:u                  #    2.938 GHz                      (50.00%)
   896,653,900,788      instructions:u            #    0.91  insn per cycle           (62.50%)
   226,954,097,302      branches:u                #  675.442 M/sec                    (62.50%)
    26,002,237,784      branch-misses:u           #   11.46% of all branches          (62.50%)
   119,818,264,279      L1-dcache-loads:u         #  356.593 M/sec                    (62.50%)
     7,935,978,293      L1-dcache-load-misses:u   #    6.62% of all L1-dcache accesses  (62.50%)
       198,601,062      LLC-loads:u               #    0.591 M/sec                    (50.00%)
       148,030,944      LLC-load-misses:u         #   74.54% of all LL-cache accesses  (50.00%)

     338.139378929 seconds time elapsed

     320.860053000 seconds user
      12.847897000 seconds sys

[kaze@kaze Quicksort_says_rev4]$

Gaming32 commented 2 years ago

How does it compare against PDQSort?

Sanmayce commented 2 years ago

@Gaming32 Don't know, it would be good to compare them, but from what I see at first glance, Magnetica is superior.

PCBoyGames commented 2 years ago

I have sent the algorithm in as part of a pull request to the ArrayV Extra Sorts Pack. Keep an eye out: Gaming32/ArrayV-Extra-Sorts#1

Gaming32 / ArrayV

Quicksort 'Magnetica' submission #73