Sanmayce / Schmekeriada

Superfast Linux/Windows console tool to sort lines, internally
0 stars 0 forks source link

It isn't faster than ClickHouse #2

Open alexey-milovidov opened 9 months ago

alexey-milovidov commented 9 months ago

I created a file, URLs.txt, and tried to sort it with Schmekeriada and with ClickHouse using 4 threads for comparison. The comparison ended up as follows:

milovidov@milovidov-desktop:~/work$ git clone git@github.com:Sanmayce/Schmekeriada.git
Cloning into 'Schmekeriada'...
remote: Enumerating objects: 18, done.
remote: Counting objects: 100% (18/18), done.
remote: Compressing objects: 100% (18/18), done.
remote: Total 18 (delta 3), reused 0 (delta 0), pack-reused 0
Receiving objects: 100% (18/18), 4.69 MiB | 5.50 MiB/s, done.
Resolving deltas: 100% (3/3), done.
milovidov@milovidov-desktop:~/work$ 
milovidov@milovidov-desktop:~/work/Schmekeriada$ tar xvf Schmekerezada_no-corpora.tar.gz 
Schmekerezada_no-corpora/README_Schmekerezada.TXT
Schmekerezada_no-corpora/crumsort.c
Schmekerezada_no-corpora/crumsort.h
Schmekerezada_no-corpora/Magnetica_v18.h
Schmekerezada_no-corpora/MAKE_CLANG_Schmekeriada.bat
Schmekerezada_no-corpora/MAKE_ICL.bat
Schmekerezada_no-corpora/quadsort.c
Schmekerezada_no-corpora/quadsort.h
Schmekerezada_no-corpora/Quicksort_Magnetica_COVERS.pdf
Schmekerezada_no-corpora/timer64.exe
Schmekerezada_no-corpora/make_elf_exe_GCC_Schmekeriada.sh
Schmekerezada_no-corpora/make_elf_CLANG_Schmekeriada.sh
Schmekerezada_no-corpora/Schmekerezada_GCC_13.0.1_SSE4.2_TetraThread.elf.asm
Schmekerezada_no-corpora/Schmekeriada.c
Schmekerezada_no-corpora/Akkodah_v3.h
Schmekerezada_no-corpora/Schmekerezada_CLANG_16.0.1_SSE4.2_TetraThread.elf
Schmekerezada_no-corpora/Schmekerezada_GCC_13.0.1_SSE4.2_TetraThread.elf
Schmekerezada_no-corpora/Schmekerezada_GCC_13.0.1_SSE4.2_MonoThread.elf
Schmekerezada_no-corpora/Schmekerezada_GCC_13.0.1_SSE4.2_TetraThread.exe
Schmekerezada_no-corpora/Schmekerezada_GCC_13.0.1_SSE4.2_MonoThread.exe
Schmekerezada_no-corpora/Schmekerezada_CLANG_16.0.1_SSE4.2_TetraThread.elf.asm
Schmekerezada_no-corpora/sort_vs_Schmekerezada.sh
Schmekerezada_no-corpora/GENERATE_Xmillion_Knight-Tours.bat
Schmekerezada_no-corpora/GENERATE_Xmillion_Knight-Tours.sh
Schmekerezada_no-corpora/Knight-Tour_FNV1A_YoshimitsuTRIADii_vs_CRC32_TRISMUS.elf
Schmekerezada_no-corpora/Knight-Tour_FNV1A_YoshimitsuTRIADii_vs_CRC32_TRISMUS.c
Schmekerezada_no-corpora/Knight-Tour_FNV1A_YoshimitsuTRIADii_vs_CRC32_TRISMUS.exe
Schmekerezada_no-corpora/bench_PARAMETER.sh
Schmekerezada_no-corpora/log_su_Intel_Kaby-Lake_i5-7200U_Cores-2.txt
Schmekerezada_no-corpora/log_su_Intel_Celeron_N4100_Cores-4.txt
Schmekerezada_no-corpora/

milovidov@milovidov-desktop:~/work/Schmekeriada/Schmekerezada_no-corpora$ ./make_elf_CLANG_Schmekeriada.sh

milovidov@milovidov-desktop:~/work/Schmekeriada/Schmekerezada_no-corpora$ 
milovidov@milovidov-desktop:~/work/Schmekeriada/Schmekerezada_no-corpora$ #./Schmekerezada_CLANG_16.0.1_SSE4.2_TetraThread.elf 

milovidov@milovidov-desktop:~/work/Schmekeriada/Schmekerezada_no-corpora$ ./Schmekerezada_CLANG_16.0.1_SSE4.2_TetraThread.elf 
   _________        .__                       __                                               .___        
  /   _____/  ____  |  |__    _____    ____  |  | __  ____ _______   ____  _____________     __| _/_____   
  \_____  \ _/ ___\ |  |  \  /     \ _/ __ \ |  |/ /_/ __ \\_  __ \_/ __ \ \___   /\__  \   / __ | \__  \  
  /        \\  \___ |   Y  \|  Y Y  \\  ___/ |    < \  ___/ |  | \/\  ___/  /    /  / __ \_/ /_/ |  / __ \_
 /_______  / \___  >|___|  /|__|_|  / \___  >|__|_ \ \___  >|__|    \___  >/_____ \(____  /\____ | (____  /
         \/      \/      \/       \/      \/      \/     \/             \/       \/     \/      \/      \/ 
This build (2023-Jul-10) features Quicksort-Magnetica, buffered dump of sorted data;
bugfix: forgot to mask the dummy threads; forgotten renaming of old function; a branch debranchified.
This tool is 100% FREE and open-source, for improvements: sanmayce@sanmayce.com, enfun!
Current priority is 0.
Usage1: Schmekeriada filename
Usage2: Schmekeriada sortertag corpustag
Usage3: Schmekeriada filename startingEXPonent endingEXPonent
Example1: Schmekeriada README.TXT - sorts README.TXT to 'Schmekeriada.txt'
Example2: Schmekeriada Akkoda manyC - sorts 'linux-5.15.25.tar' with Akkoda
Example3: Schmekeriada enwik9 0 10 - reports UBBs i.e. Unique-Building-Blocks for (powers of 2 only) orders 2^0 to 2^10
milovidov@milovidov-desktop:~/work/Schmekeriada/Schmekerezada_no-corpora$ clickhouse-client --query "SELECT URL FROM hits INTO OUTFILE 'URLs.tsv'" --progress
milovidov@milovidov-desktop:~/work/Schmekeriada/Schmekerezada_no-corpora$ time ./Schmekerezada_CLANG_16.0.1_SSE4.2_TetraThread.elf URLs.tsv
   _________        .__                       __                                               .___        
  /   _____/  ____  |  |__    _____    ____  |  | __  ____ _______   ____  _____________     __| _/_____   
  \_____  \ _/ ___\ |  |  \  /     \ _/ __ \ |  |/ /_/ __ \\_  __ \_/ __ \ \___   /\__  \   / __ | \__  \  
  /        \\  \___ |   Y  \|  Y Y  \\  ___/ |    < \  ___/ |  | \/\  ___/  /    /  / __ \_/ /_/ |  / __ \_
 /_______  / \___  >|___|  /|__|_|  / \___  >|__|_ \ \___  >|__|    \___  >/_____ \(____  /\____ | (____  /
         \/      \/      \/       \/      \/      \/     \/             \/       \/     \/      \/      \/ 
This build (2023-Jul-10) features Quicksort-Magnetica, buffered dump of sorted data;
bugfix: forgot to mask the dummy threads; forgotten renaming of old function; a branch debranchified.
This tool is 100% FREE and open-source, for improvements: sanmayce@sanmayce.com, enfun!
Current priority is 0.
Size of input file: 9,138,893,323
Allocating FILE-Buffer 8715MB ...
Counting lines ... Done in 814,091 clocks, 0.81 seconds.
Number of LF-ending lines: 99,997,497
Allocating Master-Buffer (Offsets+Lengths) NumberOfLFs*8*2 = 1525MB ... Aligned to 16 bytes boundary.
Assigning pairs (of pointers and lengths) to lines ... Done in 3,451,968 clocks, 3.45 seconds.
ShortestLine = 0
LongestLine = 7,770
Sorting pointers to lines with 'Strongfool' a.k.a. 'Quicksort_Magnetica_v19_BalxchonkaForte_indirect_ZERO' ...
Pools [a-kl-gh-qr-z] = 0-19820929 19820929-30104393 30104393-67390930 67390930-99997496
Thread #1 of 4 sorting partition size=19820930
Thread #2 of 4 sorting partition size=10283465
Thread #3 of 4 sorting partition size=37286538
Thread #4 of 4 sorting partition size=32606567
Done (just sorting) in 21 seconds.
Writing sorted lines to 'Schmekeriada.txt' ... Allocating DUMP-Buffer (for 'fwrite()') 1024MB ...
Done (just writing) in 8 seconds.
Total LPS performance: 2,702,635 Lines-Per-Second
Total BPS performance: 246,997,116 Bytes-Per-Second

real    0m36,008s
user    1m30,164s
sys     0m10,749s
milovidov@milovidov-desktop:~/work/Schmekeriada/Schmekerezada_no-corpora$ less Schmekeriada.txt
milovidov@milovidov-desktop:~/work/Schmekeriada/Schmekerezada_no-corpora$ time clickhouse-local --max_threads 4 --query "SELECT c1 FROM 'URLs.tsv' ORDER BY c1 INTO OUTFILE 'URLs_sorted.tsv'" --progress

real    0m12,056s
user    0m48,826s
sys     0m11,020s
milovidov@milovidov-desktop:~/work/Schmekeriada/Schmekerezada_no-corpora$ wc -c Schmekeriada.txt URLs_sorted.tsv
 9138893323 Schmekeriada.txt
 9138893323 URLs_sorted.tsv
18277786646 total

Schmekeriada:

real    0m36,008s
user    1m30,164s
sys     0m10,749s

ClickHouse:

real    0m12,056s
user    0m48,826s
sys     0m11,020s

CPU: AMD Ryzen Threadripper PRO 3995WX; I've recompiled the tool with clang-17.

How to obtain the dataset: https://github.com/ClickHouse/ClickBench (Create the hits table, then you can generate the URLs.tsv file as in the paste)

alexey-milovidov commented 9 months ago

Note: you can also use LineAsString format instead of TSV.

Sanmayce commented 9 months ago

Thank you! Didn't know how slow my toy was, until now, grmbl.

Being 3x slower is bothering, could you check whether the data was cached by the OS i.e. the testdatafile is being sorted twice in a row?! Also, is this clickhouse sorter capable of sorting e.g. the latest linux kernel or CLANG source (the .tar file): https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.6.2.tar.xz https://github.com/llvm/llvm-project/archive/refs/tags/llvmorg-17.0.5.tar.gz

Wonder how Schmekerezada fares with them...

alexey-milovidov commented 9 months ago

About the page cache - in both experiments, the file was cached (so no disk IO happens), also, we can pay attention to "user" time, which includes only CPU time without disk IO.

About sorting the tar file - how should I extract strings from it? Should I extract it first and concatenate the source files, or should I interpret the binary (.tar.xz) file as a set of strings delimited by \n characters?

Sanmayce commented 9 months ago

Right, about such .tar tests, the second, just treat it as it is - giving the .tar file as an argument.

Sanmayce commented 9 months ago

My suggestion (benchmarkwise) and practice is to test heavily with different files in order to spot problematic things as speed brakes or just hidden bugs, as these authors overlooked the forking of 'sort' to Windows: https://github.com/uutils/coreutils/issues/5521

alexey-milovidov commented 9 months ago
$ time clickhouse-local --max_threads 4 --query "SELECT line FROM file('linux-6.6.2.tar.xz', LineAsString, auto, none) ORDER BY line FORMAT LineAsString" --progress > sorted1

real    0m0,395s
user    0m0,378s
sys     0m0,366s
$ time clickhouse-local --max_threads 4 --query "SELECT line FROM file('llvmorg-17.0.5.tar.gz', LineAsString, auto, none) ORDER BY line FORMAT LineAsString" --progress > sorted2

real    0m0,350s
user    0m0,461s
sys     0m0,493s
milovidov@milovidov-desktop:~/work/Schmekeriada$ time Schmekerezada_no-corpora/Schmekerezada_CLANG_16.0.1_SSE4.2_TetraThread.elf linux-6.6.2.tar.xz
   _________        .__                       __                                               .___        
  /   _____/  ____  |  |__    _____    ____  |  | __  ____ _______   ____  _____________     __| _/_____   
  \_____  \ _/ ___\ |  |  \  /     \ _/ __ \ |  |/ /_/ __ \\_  __ \_/ __ \ \___   /\__  \   / __ | \__  \  
  /        \\  \___ |   Y  \|  Y Y  \\  ___/ |    < \  ___/ |  | \/\  ___/  /    /  / __ \_/ /_/ |  / __ \_
 /_______  / \___  >|___|  /|__|_|  / \___  >|__|_ \ \___  >|__|    \___  >/_____ \(____  /\____ | (____  /
         \/      \/      \/       \/      \/      \/     \/             \/       \/     \/      \/      \/ 
This build (2023-Jul-10) features Quicksort-Magnetica, buffered dump of sorted data;
bugfix: forgot to mask the dummy threads; forgotten renaming of old function; a branch debranchified.
This tool is 100% FREE and open-source, for improvements: sanmayce@sanmayce.com, enfun!
Current priority is 0.
Size of input file: 140,047,368
Allocating FILE-Buffer 133MB ...
Counting lines ... Done in 12,833 clocks, 0.01 seconds.
Number of LF-ending lines: 546,976
Postfixing the last "line" with a LF.
Allocating Master-Buffer (Offsets+Lengths) NumberOfLFs*8*2 = 8MB ... Aligned to 16 bytes boundary.
Assigning pairs (of pointers and lengths) to lines ... Done in 41,364 clocks, 0.04 seconds.
ShortestLine = 0
LongestLine = 3,301
Sorting pointers to lines with 'Strongfool' a.k.a. 'Quicksort_Magnetica_v19_BalxchonkaForte_indirect_ZERO' ...
Pools [a-kl-gh-qr-z] = 0-275793 0-0 275793-546976 0-0
Thread #1 of 4 sorting partition size=275794
Thread #2 of 4 sorting partition size=1
Thread #3 of 4 sorting partition size=271184
Thread #4 of 4 sorting partition size=1
Done (just sorting) in 0 seconds.
Writing sorted lines to 'Schmekeriada.txt' ... Allocating DUMP-Buffer (for 'fwrite()') 1024MB ...
Done (just writing) in 0 seconds.
Total LPS performance: 546,977 Lines-Per-Second
Total BPS performance: 140,047,368 Bytes-Per-Second

real    0m0,543s
user    0m18,685s
sys     0m1,152s
milovidov@milovidov-desktop:~/work/Schmekeriada$ mv Schmeker
Schmekerezada_no-corpora/        Schmekerezada_no-corpora.tar.gz  Schmekeriada.txt                 
milovidov@milovidov-desktop:~/work/Schmekeriada$ mv Schmekeriada.txt Schmekeriada1.txt
milovidov@milovidov-desktop:~/work/Schmekeriada$ time Schmekerezada_no-corpora/Schmekerezada_CLANG_16.0.1_SSE4.2_TetraThread.elf llvmorg-17.0.5.tar.gz
   _________        .__                       __                                               .___        
  /   _____/  ____  |  |__    _____    ____  |  | __  ____ _______   ____  _____________     __| _/_____   
  \_____  \ _/ ___\ |  |  \  /     \ _/ __ \ |  |/ /_/ __ \\_  __ \_/ __ \ \___   /\__  \   / __ | \__  \  
  /        \\  \___ |   Y  \|  Y Y  \\  ___/ |    < \  ___/ |  | \/\  ___/  /    /  / __ \_/ /_/ |  / __ \_
 /_______  / \___  >|___|  /|__|_|  / \___  >|__|_ \ \___  >|__|    \___  >/_____ \(____  /\____ | (____  /
         \/      \/      \/       \/      \/      \/     \/             \/       \/     \/      \/      \/ 
This build (2023-Jul-10) features Quicksort-Magnetica, buffered dump of sorted data;
bugfix: forgot to mask the dummy threads; forgotten renaming of old function; a branch debranchified.
This tool is 100% FREE and open-source, for improvements: sanmayce@sanmayce.com, enfun!
Current priority is 0.
Size of input file: 195,001,707
Allocating FILE-Buffer 185MB ...
Counting lines ... Done in 17,794 clocks, 0.02 seconds.
Number of LF-ending lines: 745,756
Postfixing the last "line" with a LF.
Allocating Master-Buffer (Offsets+Lengths) NumberOfLFs*8*2 = 11MB ... Aligned to 16 bytes boundary.
Assigning pairs (of pointers and lengths) to lines ... Done in 57,542 clocks, 0.06 seconds.
ShortestLine = 0
LongestLine = 85,098
Sorting pointers to lines with 'Strongfool' a.k.a. 'Quicksort_Magnetica_v19_BalxchonkaForte_indirect_ZERO' ...
Pools [a-kl-gh-qr-z] = 0-376157 0-0 376157-745756 0-0
Thread #1 of 4 sorting partition size=376158
Thread #2 of 4 sorting partition size=1
Thread #3 of 4 sorting partition size=369600
Thread #4 of 4 sorting partition size=1
Done (just sorting) in 0 seconds.
Writing sorted lines to 'Schmekeriada.txt' ... Allocating DUMP-Buffer (for 'fwrite()') 1024MB ...
Done (just writing) in 0 seconds.
Total LPS performance: 745,757 Lines-Per-Second
Total BPS performance: 195,001,707 Bytes-Per-Second

real    0m0,641s
user    0m19,910s
sys     0m1,151s
milovidov@milovidov-desktop:~/work/Schmekeriada$ mv Schmekeriada.txt Schmekeriada2.txt

milovidov@milovidov-desktop:~/work/Schmekeriada$ md5sum sorted1 sorted2 Schmekeriada1.txt Schmekeriada2.txt
0388f4efee4c5d0c70d277c10d08f242  sorted1
08ca25385afcc235c87b7b31c747c425  sorted2
0388f4efee4c5d0c70d277c10d08f242  Schmekeriada1.txt
08ca25385afcc235c87b7b31c747c425  Schmekeriada2.txt
Sanmayce commented 9 months ago

Thank you, well done with the checksums, just excuse me, I meant .tar files not archived .tar, in this/former way we test much more lines, please rerun with both files decompressed to .tar extension. Thanks.

alexey-milovidov commented 9 months ago
milovidov@milovidov-desktop:~/work/Schmekeriada$ time clickhouse-local --max_threads 4 --query "SELECT line FROM file('linux-6.6.2.tar', LineAsString) ORDER BY line FORMAT LineAsString" --progress > sorted1

real    0m3,604s
user    0m10,499s
sys     0m2,074s

milovidov@milovidov-desktop:~/work/Schmekeriada$ time clickhouse-local --max_threads 4 --query "SELECT line FROM file('llvmorg-17.0.5.tar', LineAsString) ORDER BY line FORMAT LineAsString" --progress > sorted2

real    0m3,549s
user    0m8,699s
sys     0m2,297s

milovidov@milovidov-desktop:~/work/Schmekeriada$ time Schmekerezada_no-corpora/Schmekerezada_CLANG_16.0.1_SSE4.2_TetraThread.elf linux-6.6.2.tar
   _________        .__                       __                                               .___        
  /   _____/  ____  |  |__    _____    ____  |  | __  ____ _______   ____  _____________     __| _/_____   
  \_____  \ _/ ___\ |  |  \  /     \ _/ __ \ |  |/ /_/ __ \\_  __ \_/ __ \ \___   /\__  \   / __ | \__  \  
  /        \\  \___ |   Y  \|  Y Y  \\  ___/ |    < \  ___/ |  | \/\  ___/  /    /  / __ \_/ /_/ |  / __ \_
 /_______  / \___  >|___|  /|__|_|  / \___  >|__|_ \ \___  >|__|    \___  >/_____ \(____  /\____ | (____  /
         \/      \/      \/       \/      \/      \/     \/             \/       \/     \/      \/      \/ 
This build (2023-Jul-10) features Quicksort-Magnetica, buffered dump of sorted data;
bugfix: forgot to mask the dummy threads; forgotten renaming of old function; a branch debranchified.
This tool is 100% FREE and open-source, for improvements: sanmayce@sanmayce.com, enfun!
Current priority is 0.
Size of input file: 1,419,171,840
Allocating FILE-Buffer 1353MB ...
Counting lines ... Done in 128,360 clocks, 0.13 seconds.
Number of LF-ending lines: 37,052,214
Postfixing the last "line" with a LF.
Allocating Master-Buffer (Offsets+Lengths) NumberOfLFs*8*2 = 565MB ... Aligned to 16 bytes boundary.
Assigning pairs (of pointers and lengths) to lines ... Done in 763,257 clocks, 0.76 seconds.
ShortestLine = 0
LongestLine = 50,950
Sorting pointers to lines with 'Strongfool' a.k.a. 'Quicksort_Magnetica_v19_BalxchonkaForte_indirect_ZERO' ...
Pools [a-kl-gh-qr-z] = 0-8332234 8332234-15997967 15997967-28239162 28239162-37052214
Thread #1 of 4 sorting partition size=8332235
Thread #2 of 4 sorting partition size=7665734
Thread #3 of 4 sorting partition size=12241196
Thread #4 of 4 sorting partition size=8813053
Done (just sorting) in 6 seconds.
Writing sorted lines to 'Schmekeriada.txt' ... Allocating DUMP-Buffer (for 'fwrite()') 1024MB ...
Done (just writing) in 2 seconds.
Total LPS performance: 3,705,221 Lines-Per-Second
Total BPS performance: 141,917,184 Bytes-Per-Second

real    0m8,770s
user    0m55,919s
sys     0m3,454s

milovidov@milovidov-desktop:~/work/Schmekeriada$ mv Schmekeriada.txt Schmekeriada1.txt

milovidov@milovidov-desktop:~/work/Schmekeriada$ time Schmekerezada_no-corpora/Schmekerezada_CLANG_16.0.1_SSE4.2_TetraThread.elf llvmorg-17.0.5.tar
   _________        .__                       __                                               .___        
  /   _____/  ____  |  |__    _____    ____  |  | __  ____ _______   ____  _____________     __| _/_____   
  \_____  \ _/ ___\ |  |  \  /     \ _/ __ \ |  |/ /_/ __ \\_  __ \_/ __ \ \___   /\__  \   / __ | \__  \  
  /        \\  \___ |   Y  \|  Y Y  \\  ___/ |    < \  ___/ |  | \/\  ___/  /    /  / __ \_/ /_/ |  / __ \_
 /_______  / \___  >|___|  /|__|_|  / \___  >|__|_ \ \___  >|__|    \___  >/_____ \(____  /\____ | (____  /
         \/      \/      \/       \/      \/      \/     \/             \/       \/     \/      \/      \/ 
This build (2023-Jul-10) features Quicksort-Magnetica, buffered dump of sorted data;
bugfix: forgot to mask the dummy threads; forgotten renaming of old function; a branch debranchified.
This tool is 100% FREE and open-source, for improvements: sanmayce@sanmayce.com, enfun!
Current priority is 0.
Size of input file: 1,591,029,760
Allocating FILE-Buffer 1517MB ...
Counting lines ... Done in 125,972 clocks, 0.13 seconds.
Number of LF-ending lines: 30,313,469
Postfixing the last "line" with a LF.
Allocating Master-Buffer (Offsets+Lengths) NumberOfLFs*8*2 = 462MB ... Aligned to 16 bytes boundary.
Assigning pairs (of pointers and lengths) to lines ... Done in 724,240 clocks, 0.72 seconds.
ShortestLine = 0
LongestLine = 2,099,162
Sorting pointers to lines with 'Strongfool' a.k.a. 'Quicksort_Magnetica_v19_BalxchonkaForte_indirect_ZERO' ...
Pools [a-kl-gh-qr-z] = 0-16828749 16828749-25704205 25704205-28397206 28397206-30313469
Thread #1 of 4 sorting partition size=16828750
Thread #2 of 4 sorting partition size=8875457
Thread #3 of 4 sorting partition size=2693002
Thread #4 of 4 sorting partition size=1916264
Done (just sorting) in 6 seconds.
Writing sorted lines to 'Schmekeriada.txt' ... Allocating DUMP-Buffer (for 'fwrite()') 1024MB ...
Done (just writing) in 2 seconds.
Total LPS performance: 2,755,770 Lines-Per-Second
Total BPS performance: 144,639,069 Bytes-Per-Second

real    0m9,846s
user    0m48,752s
sys     0m3,517s

milovidov@milovidov-desktop:~/work/Schmekeriada$ mv Schmekeriada.txt Schmekeriada2.txt

milovidov@milovidov-desktop:~/work/Schmekeriada$ md5sum sorted1 sorted2 Schmekeriada1.txt Schmekeriada2.txt
784832497c242f03de5a74e71ea7e57c  sorted1
8b73edf6e7327732233dcf911093670b  sorted2
784832497c242f03de5a74e71ea7e57c  Schmekeriada1.txt
8b73edf6e7327732233dcf911093670b  Schmekeriada2.txt
alexey-milovidov commented 9 months ago

By the way, you can download ClickHouse and run by yourself:

curl https://clickhouse.com/ | sh
Sanmayce commented 9 months ago

Many thanks Alexey, didn't know how to run it, managed to run it on my main computer 'Djudjeto' - Thinkpad 11e gen5 running Fedora 38, nvme working partition, ext4 filesystem, CPU Celeron N4100:

The CLANG:

[sanmayce@djudjeto3 Schmekeriada-main]$ cp llvm-project-llvmorg-17.0.5.tar /dev/null 
[sanmayce@djudjeto3 Schmekeriada-main]$ perf stat -d ./Schmekerezada_CLANG_16.0.1_SSE4.2_TetraThread.elf llvm-project-llvmorg-17.0.5.tar
   _________        .__                       __                                               .___        
  /   _____/  ____  |  |__    _____    ____  |  | __  ____ _______   ____  _____________     __| _/_____   
  \_____  \ _/ ___\ |  |  \  /     \ _/ __ \ |  |/ /_/ __ \\_  __ \_/ __ \ \___   /\__  \   / __ | \__  \  
  /        \\  \___ |   Y  \|  Y Y  \\  ___/ |    < \  ___/ |  | \/\  ___/  /    /  / __ \_/ /_/ |  / __ \_
 /_______  / \___  >|___|  /|__|_|  / \___  >|__|_ \ \___  >|__|    \___  >/_____ \(____  /\____ | (____  /
         \/      \/      \/       \/      \/      \/     \/             \/       \/     \/      \/      \/ 
This build (2023-Jul-10) features Quicksort-Magnetica, buffered dump of sorted data;
bugfix: forgot to mask the dummy threads; forgotten renaming of old function; a branch debranchified.
This tool is 100% FREE and open-source, for improvements: sanmayce@sanmayce.com, enfun!
Current priority is 0.
Size of input file: 1,591,029,760
Allocating FILE-Buffer 1517MB ...
Counting lines ... Done in 389,133 clocks, 0.39 seconds.
Number of LF-ending lines: 30,313,469
Postfixing the last "line" with a LF.
Allocating Master-Buffer (Offsets+Lengths) NumberOfLFs*8*2 = 462MB ... Aligned to 16 bytes boundary.
Assigning pairs (of pointers and lengths) to lines ... Done in 3,106,114 clocks, 3.11 seconds.
ShortestLine = 0
LongestLine = 2,099,162
Sorting pointers to lines with 'Strongfool' a.k.a. 'Quicksort_Magnetica_v19_BalxchonkaForte_indirect_ZERO' ...
Pools [a-kl-gh-qr-z] = 0-16828749 16828749-25704205 25704205-28397206 28397206-30313469
Thread #1 of 4 sorting partition size=16828750
Thread #2 of 4 sorting partition size=8875457
Thread #3 of 4 sorting partition size=2693002
Thread #4 of 4 sorting partition size=1916264
Done (just sorting) in 21 seconds.
Writing sorted lines to 'Schmekeriada.txt' ... Allocating DUMP-Buffer (for 'fwrite()') 1024MB ...
Done (just writing) in 20 seconds.
Total LPS performance: 631,530 Lines-Per-Second
Total BPS performance: 33,146,453 Bytes-Per-Second

 Performance counter stats for './Schmekerezada_CLANG_16.0.1_SSE4.2_TetraThread.elf llvm-project-llvmorg-17.0.5.tar':

         58,124.21 msec task-clock:u                     #    1.219 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
           552,771      page-faults:u                    #    9.510 K/sec                     
    99,346,530,054      cycles:u                         #    1.709 GHz                         (57.13%)
    54,607,426,998      instructions:u                   #    0.55  insn per cycle              (71.43%)
    11,949,561,573      branches:u                       #  205.587 M/sec                       (71.43%)
       365,335,187      branch-misses:u                  #    3.06% of all branches             (71.46%)
    11,735,684,799      L1-dcache-loads:u                #  201.907 M/sec                       (71.46%)
   <not supported>      L1-dcache-load-misses:u                                               
       921,863,815      LLC-loads:u                      #   15.860 M/sec                       (71.41%)
       323,576,230      LLC-load-misses:u                #   35.10% of all L1-icache accesses   (57.12%)

      47.689642745 seconds time elapsed

      43.424274000 seconds user
      13.183393000 seconds sys

[sanmayce@djudjeto3 Schmekeriada-main]$ perf stat -d ./clickhouse local --max_threads 4 --query "SELECT line FROM file('llvm-project-llvmorg-17.0.5.tar', LineAsString) ORDER BY line FORMAT LineAsString" --progress> sorted2

 Performance counter stats for './clickhouse local --max_threads 4 --query SELECT line FROM file('llvm-project-llvmorg-17.0.5.tar', LineAsString) ORDER BY line FORMAT LineAsString --progress':

         47,206.61 msec task-clock:u                     #    1.696 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
           828,396      page-faults:u                    #   17.548 K/sec                     
    69,215,334,173      cycles:u                         #    1.466 GHz                         (57.17%)
    46,690,210,521      instructions:u                   #    0.67  insn per cycle              (71.44%)
     6,970,013,385      branches:u                       #  147.649 M/sec                       (71.41%)
       324,041,894      branch-misses:u                  #    4.65% of all branches             (71.41%)
    12,817,172,169      L1-dcache-loads:u                #  271.512 M/sec                       (71.41%)
   <not supported>      L1-dcache-load-misses:u                                               
       799,358,711      LLC-loads:u                      #   16.933 M/sec                       (71.43%)
       148,117,026      LLC-load-misses:u                #   18.53% of all L1-icache accesses   (57.19%)

      27.830960020 seconds time elapsed

      30.403371000 seconds user
      15.278792000 seconds sys

[sanmayce@djudjeto3 Schmekeriada-main]$ sha1sum Schmekeriada.txt 
de949032fd8cec7dcc680a6b981eef89b58a4639  Schmekeriada.txt
[sanmayce@djudjeto3 Schmekeriada-main]$ sha1sum sorted2
de949032fd8cec7dcc680a6b981eef89b58a4639  sorted2
[sanmayce@djudjeto3 Schmekeriada-main]$ 

And the kernel:

[sanmayce@djudjeto3 Schmekeriada-main]$ cp linux-6.6.2.tar /dev/null 
[sanmayce@djudjeto3 Schmekeriada-main]$ perf stat -d ./Schmekerezada_CLANG_16.0.1_SSE4.2_TetraThread.elf linux-6.6.2.tar
   _________        .__                       __                                               .___        
  /   _____/  ____  |  |__    _____    ____  |  | __  ____ _______   ____  _____________     __| _/_____   
  \_____  \ _/ ___\ |  |  \  /     \ _/ __ \ |  |/ /_/ __ \\_  __ \_/ __ \ \___   /\__  \   / __ | \__  \  
  /        \\  \___ |   Y  \|  Y Y  \\  ___/ |    < \  ___/ |  | \/\  ___/  /    /  / __ \_/ /_/ |  / __ \_
 /_______  / \___  >|___|  /|__|_|  / \___  >|__|_ \ \___  >|__|    \___  >/_____ \(____  /\____ | (____  /
         \/      \/      \/       \/      \/      \/     \/             \/       \/     \/      \/      \/ 
This build (2023-Jul-10) features Quicksort-Magnetica, buffered dump of sorted data;
bugfix: forgot to mask the dummy threads; forgotten renaming of old function; a branch debranchified.
This tool is 100% FREE and open-source, for improvements: sanmayce@sanmayce.com, enfun!
Current priority is 0.
Size of input file: 1,419,171,840
Allocating FILE-Buffer 1353MB ...
Counting lines ... Done in 347,483 clocks, 0.35 seconds.
Number of LF-ending lines: 37,052,214
Postfixing the last "line" with a LF.
Allocating Master-Buffer (Offsets+Lengths) NumberOfLFs*8*2 = 565MB ... Aligned to 16 bytes boundary.
Assigning pairs (of pointers and lengths) to lines ... Done in 3,056,087 clocks, 3.06 seconds.
ShortestLine = 0
LongestLine = 50,950
Sorting pointers to lines with 'Strongfool' a.k.a. 'Quicksort_Magnetica_v19_BalxchonkaForte_indirect_ZERO' ...
Pools [a-kl-gh-qr-z] = 0-8332234 8332234-15997967 15997967-28239162 28239162-37052214
Thread #1 of 4 sorting partition size=8332235
Thread #2 of 4 sorting partition size=7665734
Thread #3 of 4 sorting partition size=12241196
Thread #4 of 4 sorting partition size=8813053
Done (just sorting) in 20 seconds.
Writing sorted lines to 'Schmekeriada.txt' ... Allocating DUMP-Buffer (for 'fwrite()') 1024MB ...
Done (just writing) in 23 seconds.
Total LPS performance: 741,044 Lines-Per-Second
Total BPS performance: 28,383,436 Bytes-Per-Second

 Performance counter stats for './Schmekerezada_CLANG_16.0.1_SSE4.2_TetraThread.elf linux-6.6.2.tar':

         66,476.23 msec task-clock:u                     #    1.344 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
           407,119      page-faults:u                    #    6.124 K/sec                     
   126,650,772,073      cycles:u                         #    1.905 GHz                         (57.12%)
    58,991,118,195      instructions:u                   #    0.47  insn per cycle              (71.41%)
    12,511,758,528      branches:u                       #  188.214 M/sec                       (71.41%)
       456,544,390      branch-misses:u                  #    3.65% of all branches             (71.43%)
    12,732,982,854      L1-dcache-loads:u                #  191.542 M/sec                       (71.44%)
   <not supported>      L1-dcache-load-misses:u                                               
     1,039,661,608      LLC-loads:u                      #   15.640 M/sec                       (71.45%)
       443,010,631      LLC-load-misses:u                #   42.61% of all L1-icache accesses   (57.15%)

      49.451142383 seconds time elapsed

      55.412086000 seconds user
       9.104981000 seconds sys

[sanmayce@djudjeto3 Schmekeriada-main]$ perf stat -d ./clickhouse local --max_threads 4 --query "SELECT line FROM file('linux-6.6.2.tar', LineAsString) ORDER BY line FORMAT LineAsString" --progress> sorted1

 Performance counter stats for './clickhouse local --max_threads 4 --query SELECT line FROM file('linux-6.6.2.tar', LineAsString) ORDER BY line FORMAT LineAsString --progress':

         52,592.04 msec task-clock:u                     #    1.862 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
           768,348      page-faults:u                    #   14.610 K/sec                     
    77,819,815,410      cycles:u                         #    1.480 GHz                         (57.15%)
    57,870,865,142      instructions:u                   #    0.74  insn per cycle              (71.43%)
     8,377,356,476      branches:u                       #  159.289 M/sec                       (71.43%)
       381,391,794      branch-misses:u                  #    4.55% of all branches             (71.42%)
    16,309,417,304      L1-dcache-loads:u                #  310.112 M/sec                       (71.46%)
   <not supported>      L1-dcache-load-misses:u                                               
       982,799,428      LLC-loads:u                      #   18.687 M/sec                       (71.43%)
       123,509,169      LLC-load-misses:u                #   12.57% of all L1-icache accesses   (57.12%)

      28.247135638 seconds time elapsed

      34.108460000 seconds user
      17.015186000 seconds sys

[sanmayce@djudjeto3 Schmekeriada-main]$ sha1sum Schmekeriada.txt 
bec5b8e3d9a01462564989e7ae225156d964e1b6  Schmekeriada.txt
[sanmayce@djudjeto3 Schmekeriada-main]$ sha1sum sorted1 
bec5b8e3d9a01462564989e7ae225156d964e1b6  sorted1
[sanmayce@djudjeto3 Schmekeriada-main]$ 

As a conclusion, 'clickhouse' is from (47.6/27.8=1.71x and 49.4/28.2=1.75x) to 3x faster than Schmekerezada, superwell done!