Closed Sanmayce closed 3 years ago
Let us see who is the fastest 128bit hasher (for 128 bytes long keys, in particular) ...
The package (all sources included) allowing to reproduce all the runs is freely downloadable at: www.sanmayce.com/Gumbotron_vs_XXH128.zip
The 200M keys were chosen in order to fit into 32GB machines, something like 25GB needed.
On laptop 'Compressionette' (Kaby Lake i5-7200U 2.5GHz (3.1GHz max turbo) 36GB DDR4 2133MHz, Windows 10) Gumbotron_YMM hashes 4,487/3,160= 1.41x faster than XXH128:
G:\Lookupperorama_r13\COLLISION_Hashliner>GvsXXH.bat
G:\Lookupperorama_r13\COLLISION_Hashliner>if exist 200000000.KnightTours.txt goto Skip
G:\Lookupperorama_r13\COLLISION_Hashliner>BenchHashingLines_Gumbotron.exe 200000000.KnightTours.txt
Hashing 200,000,000 lines/keys, 128 bytes each, in RAM ...
The first key has hash:
f4b027e3ab
Total time: 4,907 clocks.
Total time: 3,186 clocks.
Total time: 3,162 clocks.
Total time: 3,160 clocks.
Total time: 3,185 clocks.
Total time: 3,174 clocks.
Total time: 3,160 clocks.
Total time: 3,164 clocks.
Total time: 3,167 clocks.
Total time: 3,174 clocks.
Total time: 3,183 clocks.
Total time: 3,174 clocks.
Total time: 3,165 clocks.
Total time: 3,172 clocks.
Total time: 3,161 clocks.
Total time: 3,170 clocks.
Total time: 3,165 clocks.
Total time (BEST RUN): 3,160 clocks.
G:\Lookupperorama_r13\COLLISION_Hashliner>BenchHashingLines_XXH128.exe 200000000.KnightTours.txt
Hashing 200,000,000 lines/keys, 128 bytes each, in RAM ...
The first key has hash:
c703b0bd77
Total time: 6,021 clocks.
Total time: 4,492 clocks.
Total time: 4,532 clocks.
Total time: 4,509 clocks.
Total time: 4,492 clocks.
Total time: 4,490 clocks.
Total time: 4,495 clocks.
Total time: 4,494 clocks.
Total time: 4,492 clocks.
Total time: 4,491 clocks.
Total time: 4,496 clocks.
Total time: 4,487 clocks.
Total time: 4,491 clocks.
Total time: 4,493 clocks.
Total time: 4,496 clocks.
Total time: 4,494 clocks.
Total time: 4,512 clocks.
Total time (BEST RUN): 4,487 clocks.
G:\Lookupperorama_r13\COLLISION_Hashliner>
On laptop 'Brutalitto' (Renoir AMD 4800H max turbo 4.3GHz, 64GB DDR4 3200MHz, Windows 10) Gumbotron_YMM hashes 3,266/2,484= 1.31x faster than XXH128:
D:\Lookupperorama_r13\COLLISION_Hashliner>GvsXXH.bat
D:\Lookupperorama_r13\COLLISION_Hashliner>if exist 200000000.KnightTours.txt goto Skip
D:\Lookupperorama_r13\COLLISION_Hashliner>BenchHashingLines_Gumbotron.exe 200000000.KnightTours.txt
Hashing 200,000,000 lines/keys, 128 bytes each, in RAM ...
The first key has hash:
f4b027e3ab
Total time: 3,313 clocks.
Total time: 2,485 clocks.
Total time: 2,500 clocks.
Total time: 2,484 clocks.
Total time: 2,501 clocks.
Total time: 2,500 clocks.
Total time: 2,500 clocks.
Total time: 2,500 clocks.
Total time: 2,500 clocks.
Total time: 2,500 clocks.
Total time: 2,485 clocks.
Total time: 2,500 clocks.
Total time: 2,500 clocks.
Total time: 2,516 clocks.
Total time: 2,500 clocks.
Total time: 2,485 clocks.
Total time: 2,500 clocks.
Total time (BEST RUN): 2,484 clocks.
D:\Lookupperorama_r13\COLLISION_Hashliner>BenchHashingLines_XXH128.exe 200000000.KnightTours.txt
Hashing 200,000,000 lines/keys, 128 bytes each, in RAM ...
The first key has hash:
c703b0bd77
Total time: 4,204 clocks.
Total time: 3,359 clocks.
Total time: 3,360 clocks.
Total time: 3,360 clocks.
Total time: 3,359 clocks.
Total time: 3,360 clocks.
Total time: 3,360 clocks.
Total time: 3,359 clocks.
Total time: 3,375 clocks.
Total time: 3,266 clocks.
Total time: 3,349 clocks.
Total time: 3,360 clocks.
Total time: 3,375 clocks.
Total time: 3,360 clocks.
Total time: 3,390 clocks.
Total time: 3,376 clocks.
Total time: 3,375 clocks.
Total time (BEST RUN): 3,266 clocks.
D:\Lookupperorama_r13\COLLISION_Hashliner>
The actual (used in the benchmark) AVX code is given below (the main loop i.e. the handler of multiples of 64bytes is 1ab-096+6=283 bytes long, in 45 instructions):
; mark_description "Intel(R) C++ Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140726";
; mark_description "-O3 -arch:avx -FeBenchHashingLines_Gumbotron.exe -FAcs -D_WIN32_ENVIRONMENT_ -D_N_DDAES -D_5";
DoubleDeuceAES_Gumbotron_YMM PROC
...
.B7.2::
00096 c4 41 7a 6f 40
20 vmovdqu xmm8, XMMWORD PTR [32+r8]
0009c c4 41 7a 6f 38 vmovdqu xmm15, XMMWORD PTR [r8]
000a1 49 89 c1 mov r9, rax
000a4 48 83 c2 c0 add rdx, -64
000a8 48 ff c8 dec rax
000ab c4 43 3d 18 68
30 01 vinsertf128 ymm13, ymm8, XMMWORD PTR [48+r8], 1
000b2 c4 e2 15 00 fb vpshufb ymm7, ymm13, ymm3
000b7 c4 41 7e 7f 6d
20 vmovdqu YMMWORD PTR [32+r13], ymm13
000bd c4 63 fd 00 c7
4e vpermq ymm8, ymm7, 78
000c3 c4 41 7e 7f 45
40 vmovdqu YMMWORD PTR [64+r13], ymm8
000c9 c4 c2 71 dc 4d
40 vaesenc xmm1, xmm1, XMMWORD PTR [64+r13]
000cf c4 c2 71 dc 4d
50 vaesenc xmm1, xmm1, XMMWORD PTR [80+r13]
000d5 c4 43 05 18 70
10 01 vinsertf128 ymm14, ymm15, XMMWORD PTR [16+r8], 1
000dc 49 83 c0 40 add r8, 64
000e0 c4 e2 0d 00 fb vpshufb ymm7, ymm14, ymm3
000e5 c4 41 0d 60 fd vpunpcklbw ymm15, ymm14, ymm13
000ea c4 41 0d 68 ed vpunpckhbw ymm13, ymm14, ymm13
000ef c4 41 7e 7f 75
00 vmovdqu YMMWORD PTR [r13], ymm14
000f5 c4 41 7e 7f bd
80 00 00 00 vmovdqu YMMWORD PTR [128+r13], ymm15
000fe c4 41 7e 7f ad
a0 00 00 00 vmovdqu YMMWORD PTR [160+r13], ymm13
00107 c4 e3 fd 00 ff
4e vpermq ymm7, ymm7, 78
0010d c5 3d 60 f7 vpunpcklbw ymm14, ymm8, ymm7
00111 c4 c1 7e 7f 7d
60 vmovdqu YMMWORD PTR [96+r13], ymm7
00117 c5 bd 68 ff vpunpckhbw ymm7, ymm8, ymm7
0011b c4 41 7e 7f b5
c0 00 00 00 vmovdqu YMMWORD PTR [192+r13], ymm14
00124 c4 c1 7e 7f bd
e0 00 00 00 vmovdqu YMMWORD PTR [224+r13], ymm7
0012d c4 c2 51 dc 6d
00 vaesenc xmm5, xmm5, XMMWORD PTR [r13]
00133 c4 c2 59 dc a5
80 00 00 00 vaesenc xmm4, xmm4, XMMWORD PTR [128+r13]
0013c c4 c2 51 dc 6d
10 vaesenc xmm5, xmm5, XMMWORD PTR [16+r13]
00142 c4 c2 69 dc 95
c0 00 00 00 vaesenc xmm2, xmm2, XMMWORD PTR [192+r13]
0014b c4 c2 59 dc a5
a0 00 00 00 vaesenc xmm4, xmm4, XMMWORD PTR [160+r13]
00154 c4 c2 51 dc 7d
20 vaesenc xmm7, xmm5, XMMWORD PTR [32+r13]
0015a c4 42 71 dc 45
60 vaesenc xmm8, xmm1, XMMWORD PTR [96+r13]
00160 c4 c2 69 dc 95
e0 00 00 00 vaesenc xmm2, xmm2, XMMWORD PTR [224+r13]
00169 c4 42 59 dc ad
90 00 00 00 vaesenc xmm13, xmm4, XMMWORD PTR [144+r13]
00172 c4 42 41 dc 7d
30 vaesenc xmm15, xmm7, XMMWORD PTR [48+r13]
00178 c4 c2 39 dc 4d
70 vaesenc xmm1, xmm8, XMMWORD PTR [112+r13]
0017e c4 42 69 dc b5
d0 00 00 00 vaesenc xmm14, xmm2, XMMWORD PTR [208+r13]
00187 c4 c2 11 dc a5
b0 00 00 00 vaesenc xmm4, xmm13, XMMWORD PTR [176+r13]
00190 c4 e2 01 dc e9 vaesenc xmm5, xmm15, xmm1
00195 c4 c2 09 dc 95
f0 00 00 00 vaesenc xmm2, xmm14, XMMWORD PTR [240+r13]
0019e c4 e2 51 dc fc vaesenc xmm7, xmm5, xmm4
001a3 c4 e2 41 dc ea vaesenc xmm5, xmm7, xmm2
001a8 4d 85 c9 test r9, r9
001ab 0f 85 e5 fe ff
ff jne .B7.2
...
DoubleDeuceAES_Gumbotron_YMM ENDP
To me, above fragment is the fastest when comes to short keys (maybe up to 512 bytes) and multiples of 64 bytes, isn't it?
Okay, the AVX2 mainloop (the handler of multiples of 64 bytes) code is only 40 instructions (icc 19.0.0 -O3 -mavx2), should be the fastest: https://godbolt.org/z/oaW3zGcv5
vmovdqu ymm10, YMMWORD PTR [rdi] #94.43
dec rax #89.9
vmovdqu ymm11, YMMWORD PTR [32+rdi] #95.43
vpshufb ymm8, ymm10, ymm0 #101.12
vpshufb ymm7, ymm11, ymm0 #100.12
vpunpcklbw ymm9, ymm10, ymm11 #191.12
vpunpckhbw ymm12, ymm10, ymm11 #192.12
vmovdqu YMMWORD PTR [64+rsp], ymm9 #191.4
vmovdqu YMMWORD PTR [96+rsp], ymm12 #192.4
vpermq ymm14, ymm7, 78 #104.9
add rsi, -64 #494.22
vpermq ymm15, ymm8, 78 #105.9
vpunpcklbw ymm13, ymm14, ymm15 #253.12
vpunpckhbw ymm7, ymm14, ymm15 #254.12
vmovdqu YMMWORD PTR [rsp], ymm14 #104.1
vmovdqu YMMWORD PTR [32+rsp], ymm15 #105.1
vmovdqu YMMWORD PTR [128+rsp], ymm13 #253.4
vmovdqu YMMWORD PTR [160+rsp], ymm7 #254.4
vaesenc xmm1, xmm1, XMMWORD PTR [rdi] #452.12
vaesenc xmm3, xmm3, XMMWORD PTR [rsp] #454.13
vaesenc xmm4, xmm4, XMMWORD PTR [64+rsp] #457.12
vaesenc xmm3, xmm3, XMMWORD PTR [16+rsp] #464.13
vaesenc xmm1, xmm1, XMMWORD PTR [16+rdi] #462.12
vaesenc xmm7, xmm1, XMMWORD PTR [32+rdi] #472.12
vaesenc xmm6, xmm6, XMMWORD PTR [128+rsp] #459.12
vaesenc xmm4, xmm4, XMMWORD PTR [96+rsp] #467.12
vaesenc xmm8, xmm3, XMMWORD PTR [32+rsp] #474.13
vaesenc xmm6, xmm6, XMMWORD PTR [160+rsp] #469.12
vaesenc xmm9, xmm4, XMMWORD PTR [80+rsp] #477.12
vaesenc xmm3, xmm8, XMMWORD PTR [48+rsp] #484.13
vaesenc xmm11, xmm7, XMMWORD PTR [48+rdi] #482.12
add rdi, 64 #89.19
vaesenc xmm10, xmm6, XMMWORD PTR [144+rsp] #479.12
vaesenc xmm4, xmm9, XMMWORD PTR [112+rsp] #487.12
vaesenc xmm12, xmm11, xmm3 #491.12
vaesenc xmm6, xmm10, XMMWORD PTR [176+rsp] #489.12
vaesenc xmm13, xmm12, xmm4 #492.12
vaesenc xmm1, xmm13, xmm6 #493.12
cmp rax, -1 #89.9
jne ..B2.4 # Prob 82% #89.9
Didn't have time to benchmark the AVX2, currently don't see how to speed it up more...
Enter Gumbotron (a.k.a. DoubleDeuceAES_128bits) ...
My not-so-thorough runs below show that the fastest 128bit hasher is XXH3_128bits, known to me, yet: In my view, my Gumbotron is faster, when:
Imagine the usecase where 1 terakeys are enforced (trillion, yes). And those keys are 64 bytes or 128 bytes or 256 bytes long, and they have to be put into leaves of Bayer-Trees, the size becomes nasty unless they are "compressed", for instance 1Terakeys x 64 bytes = 64TB, but compressed only 16TB. Therefore, I wrote Gumbotron.
The need for speed and "lossy compression" (i.e. shrinking keys) led me to putting a lookupper and a shrinker under one hood, namely the 128bit hasher DoubleDeuceAES_128bits. There is no principal distinction between a lookupper and a shrinker, they both are hashers, but the latter serves as a checksum whereas the former as a hashtable.
The benchmark package (allowing to reproduce all the stuff) is freely downloadable with all the sources and binaries: www.sanmayce.com/Lookupperorama_r11.zip www.sanmayce.com/Gumbotron_logo.pdf
The benchmark is of two parts:
As a quick test I chose randomly two testfiles - Cihai and Judaica.
Testfile: KAZE_(Dictionary_SpecificationLanguage(ABBYY_Software_House))_Hanyu_Cihai_newSea-of-Words(Zho-Zho).dsl (42,920,232 bytes) Testmachine: Testmachine: laptop 'Brutalitto' AMD 4800H max turbo 4.3GHz, 64GB DDR4 3200MHz, Windows 10 Hashtable: 26bit, i.e. 67,108,864 slots, greater than (42,920,232 bytes), since in case of perfect hasher - slots should be more than the keys (could be all unique) at each position
Note1: The second column houses the cumulative value for all collisions, the collisions for all orders 4..64 were summed, that is. Note2: Folding of those 128bits should lessen the collisions.
Testfile: TERAPIG_EncyclopaediaJudaica(in_22_volumes)_TXT.tar (107,784,192 bytes) Testmachine: Testmachine: laptop 'Brutalitto' AMD 4800H max turbo 4.3GHz, 64GB DDR4 3200MHz, Windows 10 Hashtable: 27bit, i.e. 134,217,728 slots, greater than (107,784,192 bytes), since in case of perfect hasher - slots should be more than the keys (could be all unique) at each position
Another twist, in order to test collisions, here comes my 1 trillion 128bytes long keys testbed, since no enough memory is available, it was run as 1 billion.
Testset: "A billion Knight-Tours variants (each KT with 256 variants, the KT itself omitted) - each 128 bytes long" Testfile: 1000000000.KnightTours.txt (130,000,000,000 bytes)
The name of the game - hashing all lines and taking either 5 bytes or 6,7,8 bytes from the hash.
This is how the console looks like:
Note: The first column houses how many occurrences of the following hash are there.
TO-DO: Wish running it with 1,000,000,000,000 (did it already, but in other game) keys - could help evaluating how badly a 128bit hash results in collision(s)...
And finally the function itself: https://godbolt.org/z/1o38b4W1K
// gcc 10.1 -O3 -mavx -maes
// icc 19.0.0 -O3 -mavx
Hope, someone will improve on it and share.