Cyan4973 / xxHash

Extremely fast non-cryptographic hash algorithm
http://www.xxhash.com/
Other
9.22k stars 780 forks source link

Feedback: Speed/Collision Benchmark - Gumbotron vs XXH3 #568

Closed Sanmayce closed 3 years ago

Sanmayce commented 3 years ago

Enter Gumbotron (a.k.a. DoubleDeuceAES_128bits) ...

My not-so-thorough runs below show that the fastest 128bit hasher is XXH3_128bits, known to me, yet: In my view, my Gumbotron is faster, when:

Imagine the usecase where 1 terakeys are enforced (trillion, yes). And those keys are 64 bytes or 128 bytes or 256 bytes long, and they have to be put into leaves of Bayer-Trees, the size becomes nasty unless they are "compressed", for instance 1Terakeys x 64 bytes = 64TB, but compressed only 16TB. Therefore, I wrote Gumbotron.

The need for speed and "lossy compression" (i.e. shrinking keys) led me to putting a lookupper and a shrinker under one hood, namely the 128bit hasher DoubleDeuceAES_128bits. There is no principal distinction between a lookupper and a shrinker, they both are hashers, but the latter serves as a checksum whereas the former as a hashtable.

The benchmark package (allowing to reproduce all the stuff) is freely downloadable with all the sources and binaries: www.sanmayce.com/Lookupperorama_r11.zip www.sanmayce.com/Gumbotron_logo.pdf

The benchmark is of two parts:

As a quick test I chose randomly two testfiles - Cihai and Judaica.

Testfile: KAZE_(Dictionary_SpecificationLanguage(ABBYY_Software_House))_Hanyu_Cihai_newSea-of-Words(Zho-Zho).dsl (42,920,232 bytes) Testmachine: Testmachine: laptop 'Brutalitto' AMD 4800H max turbo 4.3GHz, 64GB DDR4 3200MHz, Windows 10 Hashtable: 26bit, i.e. 67,108,864 slots, greater than (42,920,232 bytes), since in case of perfect hasher - slots should be more than the keys (could be all unique) at each position

+--------------------------+-----------------------------+----------------------------------+---------------------------------+
| Hasher,                  | Number Of Hash Collisions = | RAW Hashing Speed (in one pass,  | Linear Hashing Speed,           |
| GCC-10.1 compiler        | Distinct Keys -             | at each position) for keys       | the whole file as one key       |
| -O3 -mavx                | Number Of Trees             | 4,6,8,10,12,14,16,18,36,64 bytes |                                 |
+--------------------------+-----------------------------+----------------------------------+---------------------------------+
| XXH3_64bits v0.8.0       |                  41,108,202 |      295,187,276 KEYS-PER-SECOND | 21,786,919,796 BYTES-PER-SECOND |
| CRC32C (_mm_crc32_u32)   |                  41,109,478 |      274,426,023 KEYS-PER-SECOND |  5,241,205,519 BYTES-PER-SECOND |
| XXH3_128bits v0.8.0      |                  41,111,196 |      214,493,903 KEYS-PER-SECOND | 20,331,706,300 BYTES-PER-SECOND |
| SHA3-224                 |                  41,111,291 |          153,854 KEYS-PER-SECOND |     22,319,413 BYTES-PER-SECOND |
| wyhash final             |                  41,112,870 |      449,897,589 KEYS-PER-SECOND | 15,086,197,539 BYTES-PER-SECOND |
| DoubleDeuceAES_Gumbotron |                  41,117,352 |      204,869,832 KEYS-PER-SECOND |  8,690,065,195 BYTES-PER-SECOND |
| FNV1A_Pippip             |                  41,488,327 |      449,897,589 KEYS-PER-SECOND |  8,101,214,043 BYTES-PER-SECOND |
+--------------------------+-----------------------------+----------------------------------+---------------------------------+

Note1: The second column houses the cumulative value for all collisions, the collisions for all orders 4..64 were summed, that is. Note2: Folding of those 128bits should lessen the collisions.

Testfile: TERAPIG_EncyclopaediaJudaica(in_22_volumes)_TXT.tar (107,784,192 bytes) Testmachine: Testmachine: laptop 'Brutalitto' AMD 4800H max turbo 4.3GHz, 64GB DDR4 3200MHz, Windows 10 Hashtable: 27bit, i.e. 134,217,728 slots, greater than (107,784,192 bytes), since in case of perfect hasher - slots should be more than the keys (could be all unique) at each position

+--------------------------+-----------------------------+----------------------------------+---------------------------------+
| Hasher,                  | Number Of Hash Collisions = | RAW Hashing Speed (in one pass,  | Linear Hashing Speed,           |
| GCC-10.1 compiler        | Distinct Keys -             | at each position) for keys       | the whole file as one key       |
| -O3 -mavx                | Number Of Trees             | 4,6,8,10,12,14,16,18,36,64 bytes |                                 |
+--------------------------+-----------------------------+----------------------------------+---------------------------------+
| DoubleDeuceAES_Gumbotron |                 135,752,271 |      204,640,573 KEYS-PER-SECOND |  8,742,330,440 BYTES-PER-SECOND |
| XXH3_128bits v0.8.0      |                 135,756,978 |      212,843,977 KEYS-PER-SECOND | 22,539,563,362 BYTES-PER-SECOND |
| wyhash final             |                 135,762,454 |      442,100,861 KEYS-PER-SECOND | 14,959,638,029 BYTES-PER-SECOND |
| XXH3_64bits v0.8.0       |                 135,763,366 |      290,994,033 KEYS-PER-SECOND | 22,464,400,166 BYTES-PER-SECOND |
| CRC32C (_mm_crc32_u32)   |                 135,764,628 |      252,599,460 KEYS-PER-SECOND |  5,241,402,061 BYTES-PER-SECOND |
| FNV1A_Pippip             |                 135,768,302 |      450,602,801 KEYS-PER-SECOND |  8,048,401,433 BYTES-PER-SECOND |
| SHA3-224                 |                 135,771,905 |          153,841 KEYS-PER-SECOND |     22,246,479 BYTES-PER-SECOND |
+--------------------------+-----------------------------+----------------------------------+---------------------------------+

Another twist, in order to test collisions, here comes my 1 trillion 128bytes long keys testbed, since no enough memory is available, it was run as 1 billion.

Testset: "A billion Knight-Tours variants (each KT with 256 variants, the KT itself omitted) - each 128 bytes long" Testfile: 1000000000.KnightTours.txt (130,000,000,000 bytes)

The name of the game - hashing all lines and taking either 5 bytes or 6,7,8 bytes from the hash.

+------------------------+----------------------------------------------------------+
| Hasher                 |                          Collisions within first 5 bytes |
+------------------------+----------------------------------------------------------+
| XXH3_64bits v0.8.0     | 1,000,000,000 - 999,545,727 distinct lines =     454,273 |
| DoubleDeuceAES_128bits | 1,000,000,000 - 999,545,796 distinct lines =     454,204 |  
+------------------------+----------------------------------------------------------+

+------------------------+----------------------------------------------------------+
| Hasher                 |                          Collisions within first 6 bytes |
+------------------------+----------------------------------------------------------+
| XXH3_64bits v0.8.0     | 1,000,000,000 - 999,998,214 distinct lines =       1,786 |
| DoubleDeuceAES_128bits | 1,000,000,000 - 999,998,213 distinct lines =       1,787 |
+------------------------+----------------------------------------------------------+

+------------------------+----------------------------------------------------------+
| Hasher                 |                          Collisions within first 7 bytes |
+------------------------+----------------------------------------------------------+
| XXH3_64bits v0.8.0     | 1,000,000,000 - 999,999,989 distinct lines =          11 |
| DoubleDeuceAES_128bits | 1,000,000,000 - 999,999,994 distinct lines =           6 | 
+------------------------+----------------------------------------------------------+

+------------------------+----------------------------------------------------------+
| Hasher                 |                          Collisions within first 8 bytes |
+------------------------+----------------------------------------------------------+
| XXH3_64bits v0.8.0     | 1,000,000,000 - 1,000,000,000 distinct lines =         0 |
| DoubleDeuceAES_128bits | 1,000,000,000 - 1,000,000,000 distinct lines =         0 | 
+------------------------+----------------------------------------------------------+

This is how the console looks like:

C:\test\COLLISION_Hashliner>GENERATE_Xmillion_Knight-Tours.bat 1000000000
Generating 1000000000 Knight-Tours and dumping them into file ...

C:\test\COLLISION_Hashliner>Knight-Tour_FNV1A_YoshimitsuTRIADii_vs_CRC32_TRISMUS.exe a8 1000000000  1>1000000000.KnightTours.txt

C:\test\COLLISION_Hashliner>bench7.bat 1000000000.KnightTours.txt

C:\test\COLLISION_Hashliner>Hashliner_XXH3_dump7byteshash.exe 1000000000.KnightTours.txt  1>1000000000.KnightTours.txt.xxh3.txt

C:\test\COLLISION_Hashliner>Hashliner_DDAES_dump7byteshash.exe 1000000000.KnightTours.txt  1>1000000000.KnightTours.txt.DDAES.txt

C:\test\COLLISION_Hashliner>Sandokan_QuickSortExternal_Deduplicated_4+GB_64bit_Intel.exe 1000000000.KnightTours.txt.xxh3.txt /fast /descend 3000
Sandokan_QuickSortExternal_4+GB r.3+, written by Kaze, using Bill Durango's Quicksort source.
Size of input file: 16,000,000,000
Counting lines ...
Lines encountered: 1,000,000,000
Longest line (including CR if present): 15
Allocated memory for pointers-to-lines in MB: 7629
Assigning pointers ...
sizeof(int), sizeof(void*): 4, 8
Trying to allocate memory for the file itself in MB: 15258 ... OK! Get on with fast internal accesses.
Uploading ...
Sorting 1,000,000,000 Pointers ...
Quicksort (Insertionsort for small blocks) commenced ...
/ RightEnd: 000,328,304,267; NumberOfSplittings: 0,114,284,204; Done: 100% ...
NumberOfComparisons: 34,310,536,510
The time to sort 1,000,000,000 items via Quicksort+Insertionsort was 2,848,402 clocks.
Performance: 12,045,534 Comparisons_128B_long-Per-Second i.e 24,091,068 RandomReads_128B_long-Per-Second.
Dumping the sorted data (Regime=2)...
\ Done 100% ...
Dumped 1,000,000,000 lines.
OK! Incoming and resultant file's sizes match.
Dumping the sorted data [deduplicated] ...
Dumped 999,999,989 distinct lines.
Dump time: 460,940 clocks.
Total time: 3,347,265 clocks.
Performance: 4,780 bytes/clock.
Done successfully.

C:\test\COLLISION_Hashliner>sort /R QuickSortExternal_4+GB.distinct.txt  1>1000000000.KnightTours.txt.xxh3.7bytes.2orABOVE.txt

C:\test\COLLISION_Hashliner>Sandokan_QuickSortExternal_Deduplicated_4+GB_64bit_Intel.exe 1000000000.KnightTours.txt.DDAES.txt /fast /descend 3000
Sandokan_QuickSortExternal_4+GB r.3+, written by Kaze, using Bill Durango's Quicksort source.
Size of input file: 16,000,000,000
Counting lines ...
Lines encountered: 1,000,000,000
Longest line (including CR if present): 15
Allocated memory for pointers-to-lines in MB: 7629
Assigning pointers ...
sizeof(int), sizeof(void*): 4, 8
Trying to allocate memory for the file itself in MB: 15258 ... OK! Get on with fast internal accesses.
Uploading ...
Sorting 1,000,000,000 Pointers ...
Quicksort (Insertionsort for small blocks) commenced ...
- RightEnd: 000,759,555,061; NumberOfSplittings: 0,114,282,509; Done: 100% ...
NumberOfComparisons: 34,551,039,764
The time to sort 1,000,000,000 items via Quicksort+Insertionsort was 2,896,271 clocks.
Performance: 11,929,487 Comparisons_128B_long-Per-Second i.e 23,858,974 RandomReads_128B_long-Per-Second.
Dumping the sorted data (Regime=2)...
\ Done 100% ...
Dumped 1,000,000,000 lines.
OK! Incoming and resultant file's sizes match.
Dumping the sorted data [deduplicated] ...
Dumped 999,999,994 distinct lines.
Dump time: 458,406 clocks.
Total time: 3,393,196 clocks.
Performance: 4,715 bytes/clock.
Done successfully.

C:\test\COLLISION_Hashliner>sort /R QuickSortExternal_4+GB.distinct.txt  1>1000000000.KnightTours.txt.DDAES.7bytes.2orABOVE.txt

C:\test\COLLISION_Hashliner>dir *7b*

15-Aug-21  12:28               156 1000000000.KnightTours.txt.DDAES.7bytes.2orABOVE.txt
15-Aug-21  11:32               286 1000000000.KnightTours.txt.xxh3.7bytes.2orABOVE.txt

C:\test\COLLISION_Hashliner>type 1000000000.KnightTours.txt.xxh3.7bytes.2orABOVE.txt
0,000,002       f84627e722e85e
0,000,002       f0039d0c4e4fce
0,000,002       c87c64d97df0e7
0,000,002       bb4344a5546572
0,000,002       af2f628f4b3ffb
0,000,002       a8cb8675c94610
0,000,002       a742cf83948622
0,000,002       657cb9dff2d962
0,000,002       436aef7ab54ce7
0,000,002       270fcde0563670
0,000,002       0b533d70915c51

C:\test\COLLISION_Hashliner>

Note: The first column houses how many occurrences of the following hash are there.

TO-DO: Wish running it with 1,000,000,000,000 (did it already, but in other game) keys - could help evaluating how badly a 128bit hash results in collision(s)...

And finally the function itself: https://godbolt.org/z/1o38b4W1K

// https://godbolt.org/ [
/*
#include <stdlib.h>
#include <stdint.h> //uint64_t needed
#include <string.h> 
#include <smmintrin.h> //SSE4.1 intrinsics
#include <wmmintrin.h>
void SlowCopy128bit (const char *SOURCE, char *TARGET) { _mm_storeu_si128((__m128i *)(TARGET), _mm_loadu_si128((const __m128i *)(SOURCE))); }
unsigned char DDAES[16];
//static const uint8_t VectorsNeedNonVAriable1[256] __attribute__((aligned(16))) =
static const uint8_t VectorsNeedNonVAriable1[256] =
{
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
    0xFF,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0xFF,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0x00,0x00,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0x00,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0x00
};
static const __m128i *Mumbotron = (__m128i *) VectorsNeedNonVAriable1;
//static const uint8_t VectorsNeedNonVAriable2[256] __attribute__((aligned(16))) =
static const uint8_t VectorsNeedNonVAriable2[256] =
{
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
    0x00,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0x00,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0x00,0x00,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xFF,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xFF
}; 
static const __m128i *Jumbotron = (__m128i *) VectorsNeedNonVAriable2;
// Written by Sanmayce, inspired by J. Andrew Rogers's https://github.com/jandrewrogers/AquaHash/blob/master/aquahash.h
// This hash function serves two ... functions - useful for table lookups and to shrink keys (usually 64...256 bytes in length) down to 16 bytes:
// Inhere using (when the key is not a multiple of 16, therefore padding is needed) Pippip's approach to read past the end ("the dirty" sentinel like style, or more like padding):
void DoubleDeuceAES_Gumbotron(const uint8_t *buffer, size_t length) {
    size_t i, Cycles;
    __m128i hashA = _mm_set_epi64x(0x6c62272e07bb0142, 0x62b821756295c58d); // 0x6c62272e07bb014262b821756295c58d // _mm_setzero_si128();
    __m128i hashB = _mm_set_epi64x(0xdd268dbcaac55036, 0x2d98c384c4e576cc); // 0xdd268dbcaac550362d98c384c4e576ccc8b1536847b6bbb31023b4c8caee0535 // FNV offset basis
    __m128i hashC = _mm_set_epi64x(0xc8b1536847b6bbb3, 0x1023b4c8caee0535); // 0xdd268dbcaac550362d98c384c4e576ccc8b1536847b6bbb31023b4c8caee0535 // FNV offset basis
    __m128i hashD = _mm_setzero_si128();
    __m128i a0,a1,a2,a3; // Instead of this chunkenization, ZMM houses the 4 XMMs, if there is shuffle across all the 512bits, use it. There is, but __m256i _mm256_shuffle_epi8(__m256i a, __m256i b) is more handy.
    __m128i b0,b1,b2,b3;
    __m128i c0,c1,c2,c3;
    __m128i d0,d1,d2,d3;
    __m128i tmp0,tmp1,tmp2,tmp3;
    __m128i ReverseMask =   _mm_set_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
    __m128i PartialInterleavingMask1 = _mm_set_epi8(0x80,7,0x80,6,0x80,5,0x80,4,0x80,3,0x80,2,0x80,1,0x80,0);
    __m128i PartialInterleavingMask2 = _mm_set_epi8(0x80,0xf,0x80,0xe,0x80,0xd,0x80,0xc,0x80,0xb,0x80,0xa,0x80,9,0x80,8); 
    __m128i PartialInterleavingMask3 = _mm_set_epi8(7,0x80,6,0x80,5,0x80,4,0x80,3,0x80,2,0x80,1,0x80,0,0x80);
    __m128i PartialInterleavingMask4 = _mm_set_epi8(0xf,0x80,0xe,0x80,0xd,0x80,0xc,0x80,0xb,0x80,0xa,0x80,9,0x80,8,0x80);
    const __m128i *ptr128a, *ptr128b, *ptr128c, *ptr128d;

    __m128i AgainstRules, GumbotronREVER, GumbotronINTER, Gumbotron, GumbotronREVERINTER;
    const __m128i *ptr128; 
    __m128i InterleaveMask =   _mm_set_epi8(15,7,14,6,13,5,12,4,11,3,10,2,9,1,8,0);

    if (length >= 64) {
        Cycles = length/64;
        for(; Cycles--; buffer += 64) {
            a0 = _mm_loadu_si128((__m128i *)(buffer+0*16)); 
            a1 = _mm_loadu_si128((__m128i *)(buffer+1*16)); 
            a2 = _mm_loadu_si128((__m128i *)(buffer+2*16)); 
            a3 = _mm_loadu_si128((__m128i *)(buffer+3*16)); 
            b0 = _mm_shuffle_epi8 (a3, ReverseMask); 
            b1 = _mm_shuffle_epi8 (a2, ReverseMask); 
            b2 = _mm_shuffle_epi8 (a1, ReverseMask); 
            b3 = _mm_shuffle_epi8 (a0, ReverseMask); 
            tmp0 = _mm_shuffle_epi8 (a0, PartialInterleavingMask1);
            tmp1 = _mm_shuffle_epi8 (a0, PartialInterleavingMask2);
            tmp2 = _mm_shuffle_epi8 (a2, PartialInterleavingMask3);
            tmp3 = _mm_shuffle_epi8 (a2, PartialInterleavingMask4);
            c0 = _mm_or_si128 (tmp0, tmp2);
            c1 = _mm_or_si128 (tmp1, tmp3);
            tmp0 = _mm_shuffle_epi8 (a1, PartialInterleavingMask1);
            tmp1 = _mm_shuffle_epi8 (a1, PartialInterleavingMask2);
            tmp2 = _mm_shuffle_epi8 (a3, PartialInterleavingMask3);
            tmp3 = _mm_shuffle_epi8 (a3, PartialInterleavingMask4);
            c2 = _mm_or_si128 (tmp0, tmp2);
            c3 = _mm_or_si128 (tmp1, tmp3);
            tmp0 = _mm_shuffle_epi8 (b0, PartialInterleavingMask1);
            tmp1 = _mm_shuffle_epi8 (b0, PartialInterleavingMask2);
            tmp2 = _mm_shuffle_epi8 (b2, PartialInterleavingMask3);
            tmp3 = _mm_shuffle_epi8 (b2, PartialInterleavingMask4);
            d0 = _mm_or_si128 (tmp0, tmp2);
            d1 = _mm_or_si128 (tmp1, tmp3);
            tmp0 = _mm_shuffle_epi8 (b1, PartialInterleavingMask1);
            tmp1 = _mm_shuffle_epi8 (b1, PartialInterleavingMask2);
            tmp2 = _mm_shuffle_epi8 (b3, PartialInterleavingMask3);
            tmp3 = _mm_shuffle_epi8 (b3, PartialInterleavingMask4);
            d2 = _mm_or_si128 (tmp0, tmp2);
            d3 = _mm_or_si128 (tmp1, tmp3);

            hashA = _mm_aesenc_si128(hashA, a0);
            hashB = _mm_aesenc_si128(hashB, b0);
            hashC = _mm_aesenc_si128(hashC, c0);
            hashD = _mm_aesenc_si128(hashD, d0);

            hashA = _mm_aesenc_si128(hashA, a1);
            hashB = _mm_aesenc_si128(hashB, b1);
            hashC = _mm_aesenc_si128(hashC, c1);
            hashD = _mm_aesenc_si128(hashD, d1);

            hashA = _mm_aesenc_si128(hashA, a2);
            hashB = _mm_aesenc_si128(hashB, b2);
            hashC = _mm_aesenc_si128(hashC, c2);
            hashD = _mm_aesenc_si128(hashD, d2);

            hashA = _mm_aesenc_si128(hashA, a3);
            hashB = _mm_aesenc_si128(hashB, b3);
            hashC = _mm_aesenc_si128(hashC, c3);
            hashD = _mm_aesenc_si128(hashD, d3);

            hashA = _mm_aesenc_si128(hashA, hashB);
            hashA = _mm_aesenc_si128(hashA, hashC);
            hashA = _mm_aesenc_si128(hashA, hashD);
            length = length - 64;
        }
    }

    ptr128 = (__m128i *)buffer;
    if (length >=16) {
        Cycles = length/16;
        for(; Cycles--; buffer += 16) {
            AgainstRules = _mm_loadu_si128(ptr128++);
            GumbotronREVER = _mm_shuffle_epi8 (AgainstRules, ReverseMask);
            GumbotronINTER = _mm_shuffle_epi8 (AgainstRules, InterleaveMask);
            GumbotronREVERINTER = _mm_shuffle_epi8 (GumbotronREVER, InterleaveMask);
            hashA = _mm_aesenc_si128(hashA, AgainstRules);
            hashB = _mm_aesenc_si128(hashB, GumbotronREVER);
            hashC = _mm_aesenc_si128(hashC, GumbotronINTER);
            hashD = _mm_aesenc_si128(hashD, GumbotronREVERINTER);
            hashA = _mm_aesenc_si128(hashA, hashB);
            hashA = _mm_aesenc_si128(hashA, hashC);
            hashA = _mm_aesenc_si128(hashA, hashD);
            length = length - 16;
        }
    } 
    // Inhere using Pippip's approach to read past the end ("the dirty" sentinel like style, or more like padding):
    if (length&(16-1)) {
        AgainstRules = _mm_loadu_si128(ptr128);     
        //AgainstRules = _mm_srli_si128 (AgainstRules, 16-length); // catastrophic error: Intrinsic parameter must be an immediate value
        AgainstRules = _mm_and_si128 (AgainstRules, Mumbotron[length]);
        //Gumbotron = _mm_slli_si128 (Gumbotron, 16-length); // catastrophic error: Intrinsic parameter must be an immediate value
        Gumbotron = _mm_and_si128 (hashB, Jumbotron[length]);
        AgainstRules = _mm_or_si128 (AgainstRules, Gumbotron);

        GumbotronREVER = _mm_shuffle_epi8 (AgainstRules, ReverseMask);
        GumbotronINTER = _mm_shuffle_epi8 (AgainstRules, InterleaveMask);
        GumbotronREVERINTER = _mm_shuffle_epi8 (GumbotronREVER, InterleaveMask);
            hashA = _mm_aesenc_si128(hashA, AgainstRules);
            hashB = _mm_aesenc_si128(hashB, GumbotronREVER);
            hashC = _mm_aesenc_si128(hashC, GumbotronINTER);
            hashD = _mm_aesenc_si128(hashD, GumbotronREVERINTER);
            hashA = _mm_aesenc_si128(hashA, hashB);
            hashA = _mm_aesenc_si128(hashA, hashC);
            hashA = _mm_aesenc_si128(hashA, hashD);
    }
    SlowCopy128bit( (const char *)(&hashA), (char *)&DDAES[0]);
}
*/
// https://godbolt.org/ ]

// gcc 10.1 -O3 -mavx -maes

SlowCopy128bit(char const*, char*):
        vmovdqu xmm0, XMMWORD PTR [rdi]
        vmovdqu XMMWORD PTR [rsi], xmm0
        ret
DoubleDeuceAES_Gumbotron(unsigned char const*, unsigned long):
        sub     rsp, 96
        cmp     rsi, 63
        jbe     .L9
        vmovdqa xmm5, XMMWORD PTR .LC0[rip]
        mov     rax, rsi
        mov     rdx, rsi
        vpxor   xmm6, xmm6, xmm6
        and     rax, -64
        vmovdqa xmm10, XMMWORD PTR .LC1[rip]
        vmovdqa xmm12, XMMWORD PTR .LC2[rip]
        shr     rdx, 6
        vmovdqa xmm13, XMMWORD PTR .LC3[rip]
        vmovdqa xmm15, XMMWORD PTR .LC6[rip]
        add     rax, rdi
        vmovdqa XMMWORD PTR [rsp-120], xmm6
        vmovdqa xmm14, XMMWORD PTR .LC7[rip]
        vmovdqa XMMWORD PTR [rsp-104], xmm5
.L5:
        vmovdqu xmm0, XMMWORD PTR [rdi]
        vmovdqu xmm11, XMMWORD PTR [rdi+32]
        add     rdi, 64
        vmovdqu xmm4, XMMWORD PTR [rdi-32]
        vmovdqu xmm7, XMMWORD PTR [rdi-16]
        vpshufb xmm2, xmm0, XMMWORD PTR .LC5[rip]
        vpshufb xmm9, xmm11, xmm14
        vpshufb xmm3, xmm0, XMMWORD PTR .LC4[rip]
        vpshufb xmm8, xmm4, xmm13
        vpshufb xmm4, xmm0, xmm13
        vmovdqa XMMWORD PTR [rsp-72], xmm2
        vpshufb xmm2, xmm11, xmm15
        vmovdqu xmm11, XMMWORD PTR [rdi-48]
        vpshufb xmm6, xmm4, xmm15
        vpshufb xmm1, xmm7, xmm13
        vmovdqa XMMWORD PTR [rsp-56], xmm9
        vmovdqa XMMWORD PTR [rsp+56], xmm6
        vmovdqu xmm9, XMMWORD PTR [rdi-16]
        vpshufb xmm6, xmm4, xmm14
        vpshufb xmm5, xmm8, XMMWORD PTR .LC4[rip]
        vmovdqa XMMWORD PTR [rsp-88], xmm3
        vmovdqu xmm3, XMMWORD PTR [rdi-48]
        vpor    xmm2, xmm2, XMMWORD PTR [rsp-88]
        vpshufb xmm11, xmm11, XMMWORD PTR .LC4[rip]
        vmovdqa XMMWORD PTR [rsp-40], xmm11
        vmovdqu xmm11, XMMWORD PTR [rdi-16]
        vpshufb xmm9, xmm9, xmm15
        vmovdqu xmm7, XMMWORD PTR [rdi-48]
        vmovdqa XMMWORD PTR [rsp+72], xmm6
        vaesenc xmm12, xmm12, xmm1
        vaesenc xmm2, xmm10, xmm2
        vmovdqa xmm6, XMMWORD PTR [rsp-104]
        vmovdqa XMMWORD PTR [rsp-8], xmm9
        vpshufb xmm11, xmm11, xmm14
        vpshufb xmm7, xmm7, xmm13
        vpshufb xmm3, xmm3, XMMWORD PTR .LC5[rip]
        vaesenc xmm0, xmm6, xmm0
        vmovdqa XMMWORD PTR [rsp-24], xmm3
        vpshufb xmm9, xmm1, XMMWORD PTR .LC5[rip]
        vmovdqa xmm6, XMMWORD PTR [rsp-120]
        vaesenc xmm0, xmm0, XMMWORD PTR [rdi-48]
        vmovdqa XMMWORD PTR [rsp+8], xmm11
        vaesenc xmm12, xmm12, xmm8
        vpshufb xmm3, xmm1, XMMWORD PTR .LC4[rip]
        vmovdqa XMMWORD PTR [rsp+24], xmm9
        vpshufb xmm11, xmm7, xmm15
        vpshufb xmm9, xmm7, xmm14
        vaesenc xmm12, xmm12, xmm7
        vmovdqa XMMWORD PTR [rsp+40], xmm5
        vpor    xmm3, xmm3, xmm11
        vaesenc xmm12, xmm12, xmm4
        vpshufb xmm5, xmm8, XMMWORD PTR .LC5[rip]
        vmovdqa xmm1, XMMWORD PTR [rsp-56]
        vpor    xmm10, xmm1, XMMWORD PTR [rsp-72]
        vaesenc xmm3, xmm6, xmm3
        vaesenc xmm0, xmm0, XMMWORD PTR [rdi-32]
        vmovdqa xmm7, XMMWORD PTR [rsp-8]
        vpor    xmm9, xmm9, XMMWORD PTR [rsp+24]
        vaesenc xmm0, xmm0, XMMWORD PTR [rdi-16]
        vaesenc xmm2, xmm2, xmm10
        vpor    xmm10, xmm7, XMMWORD PTR [rsp-40]
        vmovdqa xmm4, XMMWORD PTR [rsp+8]
        vmovdqa xmm7, XMMWORD PTR [rsp+56]
        vpor    xmm6, xmm7, XMMWORD PTR [rsp+40]
        vaesenc xmm3, xmm3, xmm9
        vaesenc xmm2, xmm2, xmm10
        vpor    xmm10, xmm4, XMMWORD PTR [rsp-24]
        vmovdqa xmm4, XMMWORD PTR [rsp+72]
        vaesenc xmm3, xmm3, xmm6
        vaesenc xmm10, xmm2, xmm10
        vpor    xmm6, xmm4, xmm5
        vaesenc xmm5, xmm0, xmm12
        vaesenc xmm1, xmm3, xmm6
        vaesenc xmm5, xmm5, xmm10
        vaesenc xmm6, xmm5, xmm1
        vmovdqa XMMWORD PTR [rsp-120], xmm1
        vmovdqa XMMWORD PTR [rsp-104], xmm6
        cmp     rdi, rax
        jne     .L5
        sal     rdx, 6
        vmovdqa xmm5, xmm6
        vmovdqa xmm6, xmm1
        sub     rsi, rdx
        cmp     rsi, 15
        ja      .L16
.L6:
        test    sil, 15
        je      .L8
        sal     rsi, 4
        vmovdqu xmm0, XMMWORD PTR [rdi]
        vmovdqa xmm2, XMMWORD PTR .LC8[rip]
        vpand   xmm1, xmm12, XMMWORD PTR VectorsNeedNonVAriable2[rsi]
        vpand   xmm0, xmm0, XMMWORD PTR VectorsNeedNonVAriable1[rsi]
        vpor    xmm0, xmm0, xmm1
        vpshufb xmm1, xmm0, XMMWORD PTR .LC3[rip]
        vpshufb xmm3, xmm0, xmm2
        vaesenc xmm0, xmm5, xmm0
        vpshufb xmm2, xmm1, xmm2
        vaesenc xmm1, xmm12, xmm1
        vaesenc xmm3, xmm10, xmm3
        vaesenc xmm1, xmm0, xmm1
        vaesenc xmm2, xmm6, xmm2
        vaesenc xmm0, xmm1, xmm3
        vaesenc xmm5, xmm0, xmm2
.L8:
        vmovdqa XMMWORD PTR DDAES[rip], xmm5
        add     rsp, 96
        ret
.L9:
        vmovdqa xmm5, XMMWORD PTR .LC0[rip]
        vmovdqa xmm10, XMMWORD PTR .LC1[rip]
        vpxor   xmm6, xmm6, xmm6
        vmovdqa xmm12, XMMWORD PTR .LC2[rip]
        cmp     rsi, 15
        jbe     .L6
.L16:
        mov     rax, rsi
        mov     rdx, rsi
        vmovdqa xmm13, XMMWORD PTR .LC3[rip]
        vmovdqa xmm2, XMMWORD PTR .LC8[rip]
        and     rax, -16
        shr     rdx, 4
        add     rax, rdi
.L7:
        vmovdqu xmm0, XMMWORD PTR [rdi]
        add     rdi, 16
        vpshufb xmm1, xmm0, xmm13
        vpshufb xmm4, xmm0, xmm2
        vaesenc xmm5, xmm5, xmm0
        vaesenc xmm12, xmm12, xmm1
        vpshufb xmm3, xmm1, xmm2
        vaesenc xmm10, xmm10, xmm4
        vaesenc xmm5, xmm5, xmm12
        vaesenc xmm6, xmm6, xmm3
        vaesenc xmm5, xmm5, xmm10
        vaesenc xmm5, xmm5, xmm6
        cmp     rax, rdi
        jne     .L7
        sal     rdx, 4
        sub     rsi, rdx
        jmp     .L6
...

// icc 19.0.0 -O3 -mavx

SlowCopy128bit(char const*, char*):
        vmovdqu   xmm0, XMMWORD PTR [rdi]                       #6.129
        vmovdqu   XMMWORD PTR [rsi], xmm0                       #6.86
        ret                                                     #6.141
DoubleDeuceAES_Gumbotron(unsigned char const*, unsigned long):
        vmovups   xmm8, XMMWORD PTR .L_2il0floatpacket.0[rip]   #52.18
        vpxor     xmm15, xmm15, xmm15                           #55.18
        vmovups   xmm4, XMMWORD PTR .L_2il0floatpacket.1[rip]   #53.18
        vmovups   xmm1, XMMWORD PTR .L_2il0floatpacket.2[rip]   #54.18
        vmovdqu   xmm0, XMMWORD PTR .L_2il0floatpacket.3[rip]   #61.26
        vmovdqu   xmm7, XMMWORD PTR .L_2il0floatpacket.4[rip]   #62.37
        vmovdqu   xmm5, XMMWORD PTR .L_2il0floatpacket.5[rip]   #63.37
        vmovdqu   xmm6, XMMWORD PTR .L_2il0floatpacket.6[rip]   #64.37
        vmovdqu   xmm3, XMMWORD PTR .L_2il0floatpacket.7[rip]   #65.37
        vmovdqu   xmm2, XMMWORD PTR .L_2il0floatpacket.8[rip]   #70.29
        cmp       rsi, 64                                       #72.16
        jb        ..B2.6        # Prob 50%                      #72.16
        mov       rax, rsi                                      #73.19
        shr       rax, 6                                        #73.19
        dec       rax                                           #74.9
        cmp       rax, -1                                       #74.9
        je        ..B2.7        # Prob 10%                      #74.9
..B2.4:                         # Preds ..B2.2 ..B2.4
        vmovdqu   xmm7, XMMWORD PTR [48+rdi]                    #78.37
        vmovdqu   xmm0, XMMWORD PTR [rdi]                       #75.37
        vmovdqu   xmm5, XMMWORD PTR [16+rdi]                    #76.37
        vmovdqu   xmm3, XMMWORD PTR [32+rdi]                    #77.37
        vmovdqu   xmm12, XMMWORD PTR .L_2il0floatpacket.3[rip]  #79.9
        dec       rax                                           #74.9
        vpshufb   xmm9, xmm7, xmm12                             #79.9
        vpshufb   xmm13, xmm3, xmm12                            #80.9
        vaesenc   xmm10, xmm8, xmm0                             #108.12
        add       rsi, -64                                      #131.22
        vaesenc   xmm8, xmm10, xmm5                             #113.12
        add       rdi, 64                                       #74.19
        vaesenc   xmm11, xmm4, xmm9                             #109.12
        vaesenc   xmm4, xmm8, xmm3                              #118.12
        vpshufb   xmm8, xmm5, xmm12                             #81.9
        vpshufb   xmm12, xmm0, xmm12                            #82.9
        vaesenc   xmm14, xmm11, xmm13                           #114.12
        vaesenc   xmm14, xmm14, xmm8                            #119.12
        vaesenc   xmm11, xmm4, xmm7                             #123.12
        vaesenc   xmm10, xmm14, xmm12                           #124.12
        vmovups   XMMWORD PTR [-24+rsp], xmm10                  #124.12[spill]
        vaesenc   xmm14, xmm11, xmm10                           #128.12
        vmovdqu   xmm10, XMMWORD PTR .L_2il0floatpacket.4[rip]  #83.11
        vmovdqu   xmm11, XMMWORD PTR .L_2il0floatpacket.5[rip]  #84.11
        vpshufb   xmm4, xmm0, xmm10                             #83.11
        vpshufb   xmm2, xmm0, xmm11                             #84.11
        vpshufb   xmm0, xmm3, xmm6                              #85.11
        vpor      xmm4, xmm4, xmm0                              #87.9
        vaesenc   xmm1, xmm1, xmm4                              #110.12
        vmovdqu   xmm4, XMMWORD PTR .L_2il0floatpacket.7[rip]   #86.11
        vpshufb   xmm3, xmm3, xmm4                              #86.11
        vpshufb   xmm0, xmm5, xmm10                             #89.11
        vpor      xmm2, xmm2, xmm3                              #88.9
        vaesenc   xmm2, xmm1, xmm2                              #115.12
        vpshufb   xmm1, xmm5, xmm11                             #90.11
        vpshufb   xmm5, xmm7, xmm6                              #91.11
        vpshufb   xmm3, xmm8, xmm6                              #97.11
        vpshufb   xmm7, xmm7, xmm4                              #92.11
        vpshufb   xmm8, xmm8, xmm4                              #98.11
        vpshufb   xmm4, xmm12, xmm4                             #104.11
        vpor      xmm0, xmm0, xmm5                              #93.9
        vpor      xmm1, xmm1, xmm7                              #94.9
        vaesenc   xmm2, xmm2, xmm0                              #120.12
        vpshufb   xmm0, xmm9, xmm10                             #95.11
        vpshufb   xmm10, xmm13, xmm10                           #101.11
        vpor      xmm5, xmm0, xmm3                              #99.9
        vaesenc   xmm0, xmm15, xmm5                             #111.12
        vpshufb   xmm15, xmm9, xmm11                            #96.11
        vpshufb   xmm11, xmm13, xmm11                           #102.11
        vaesenc   xmm1, xmm2, xmm1                              #125.12
        vpor      xmm15, xmm15, xmm8                            #100.9
        vpshufb   xmm2, xmm12, xmm6                             #103.11
        vaesenc   xmm3, xmm0, xmm15                             #116.12
        vpor      xmm5, xmm10, xmm2                             #105.9
        vaesenc   xmm7, xmm3, xmm5                              #121.12
        vpor      xmm9, xmm11, xmm4                             #106.9
        vaesenc   xmm15, xmm7, xmm9                             #126.12
        vaesenc   xmm13, xmm14, xmm1                            #129.12
        vaesenc   xmm8, xmm13, xmm15                            #130.12
        vmovups   xmm4, XMMWORD PTR [-24+rsp]                   #74.9[spill]
        cmp       rax, -1                                       #74.9
        jne       ..B2.4        # Prob 82%                      #74.9
        vmovdqu   xmm2, XMMWORD PTR .L_2il0floatpacket.8[rip]   #
        vmovdqu   xmm0, XMMWORD PTR .L_2il0floatpacket.3[rip]   #
..B2.6:                         # Preds ..B2.5 ..B2.1
        cmp       rsi, 16                                       #136.15
        jb        ..B2.11       # Prob 50%                      #136.15
..B2.7:                         # Preds ..B2.2 ..B2.6
        mov       rax, rsi                                      #137.19
        shr       rax, 4                                        #137.19
        dec       rax                                           #138.9
        cmp       rax, -1                                       #138.9
        je        ..B2.11       # Prob 10%                      #138.9
..B2.9:                         # Preds ..B2.7 ..B2.9
        vmovdqu   xmm7, XMMWORD PTR [rdi]                       #139.35
        vpshufb   xmm5, xmm7, xmm0                              #140.21
        vpshufb   xmm3, xmm7, xmm2                              #141.21
        vpshufb   xmm6, xmm5, xmm2                              #142.26
        vaesenc   xmm4, xmm4, xmm5                              #144.12
        dec       rax                                           #138.9
        vaesenc   xmm8, xmm8, xmm7                              #143.12
        add       rdi, 16                                       #139.35
        vaesenc   xmm1, xmm1, xmm3                              #145.12
        add       rsi, -16                                      #150.22
        vaesenc   xmm9, xmm8, xmm4                              #147.12
        vaesenc   xmm15, xmm15, xmm6                            #146.12
        vaesenc   xmm10, xmm9, xmm1                             #148.12
        vaesenc   xmm8, xmm10, xmm15                            #149.12
        cmp       rax, -1                                       #138.9
        jne       ..B2.9        # Prob 82%                      #138.9
..B2.11:                        # Preds ..B2.9 ..B2.7 ..B2.6
        test      rsi, 15                                       #154.13
        je        ..B2.13       # Prob 50%                      #154.13
        shl       rsi, 4                                        #157.47
        vmovdqu   xmm3, XMMWORD PTR [rdi]                       #157.18
        mov       rax, QWORD PTR Mumbotron[rip]                 #157.18
        mov       rdx, QWORD PTR Jumbotron[rip]                 #159.15
        vpand     xmm5, xmm3, XMMWORD PTR [rsi+rax]             #157.18
        vpand     xmm6, xmm4, XMMWORD PTR [rsi+rdx]             #159.15
        vpor      xmm7, xmm5, xmm6                              #160.18
        vpshufb   xmm10, xmm7, xmm0                             #162.20
        vpshufb   xmm0, xmm7, xmm2                              #163.20
        vpshufb   xmm2, xmm10, xmm2                             #164.25
        vaesenc   xmm8, xmm8, xmm7                              #165.12
        vaesenc   xmm4, xmm4, xmm10                             #166.12
        vaesenc   xmm9, xmm8, xmm4                              #169.12
        vaesenc   xmm1, xmm1, xmm0                              #167.12
        vaesenc   xmm11, xmm9, xmm1                             #170.12
        vaesenc   xmm12, xmm15, xmm2                            #168.12
        vaesenc   xmm8, xmm11, xmm12                            #171.12
..B2.13:                        # Preds ..B2.12 ..B2.11
        vmovups   XMMWORD PTR DDAES[rip], xmm8                  #6.86
        ret                                                     #174.1
__sti__$E:
        mov       QWORD PTR Mumbotron[rip], offset flat: VectorsNeedNonVAriable1 #28.47
        mov       QWORD PTR Jumbotron[rip], offset flat: VectorsNeedNonVAriable2 #49.47
        ret                                                     #28.47
Mumbotron:
Jumbotron:
DDAES:
VectorsNeedNonVAriable1:
...
VectorsNeedNonVAriable2:
...
.L_2il0floatpacket.0:
        .long   0x6295c58d,0x62b82175,0x07bb0142,0x6c62272e
.L_2il0floatpacket.1:
        .long   0xc4e576cc,0x2d98c384,0xaac55036,0xdd268dbc
.L_2il0floatpacket.2:
        .long   0xcaee0535,0x1023b4c8,0x47b6bbb3,0xc8b15368
.L_2il0floatpacket.3:
        .long   0x0c0d0e0f,0x08090a0b,0x04050607,0x00010203
.L_2il0floatpacket.4:
        .long   0x80018000,0x80038002,0x80058004,0x80078006
.L_2il0floatpacket.5:
        .long   0x80098008,0x800b800a,0x800d800c,0x800f800e
.L_2il0floatpacket.6:
        .long   0x01800080,0x03800280,0x05800480,0x07800680
.L_2il0floatpacket.7:
        .long   0x09800880,0x0b800a80,0x0d800c80,0x0f800e80
.L_2il0floatpacket.8:
        .long   0x09010800,0x0b030a02,0x0d050c04,0x0f070e06

Hope, someone will improve on it and share.

Sanmayce commented 3 years ago

Let us see who is the fastest 128bit hasher (for 128 bytes long keys, in particular) ...

The package (all sources included) allowing to reproduce all the runs is freely downloadable at: www.sanmayce.com/Gumbotron_vs_XXH128.zip

The 200M keys were chosen in order to fit into 32GB machines, something like 25GB needed.

On laptop 'Compressionette' (Kaby Lake i5-7200U 2.5GHz (3.1GHz max turbo) 36GB DDR4 2133MHz, Windows 10) Gumbotron_YMM hashes 4,487/3,160= 1.41x faster than XXH128:

G:\Lookupperorama_r13\COLLISION_Hashliner>GvsXXH.bat

G:\Lookupperorama_r13\COLLISION_Hashliner>if exist 200000000.KnightTours.txt goto Skip

G:\Lookupperorama_r13\COLLISION_Hashliner>BenchHashingLines_Gumbotron.exe 200000000.KnightTours.txt
Hashing 200,000,000 lines/keys, 128 bytes each, in RAM ...
The first key has hash:
f4b027e3ab
Total time: 4,907 clocks.
Total time: 3,186 clocks.
Total time: 3,162 clocks.
Total time: 3,160 clocks.
Total time: 3,185 clocks.
Total time: 3,174 clocks.
Total time: 3,160 clocks.
Total time: 3,164 clocks.
Total time: 3,167 clocks.
Total time: 3,174 clocks.
Total time: 3,183 clocks.
Total time: 3,174 clocks.
Total time: 3,165 clocks.
Total time: 3,172 clocks.
Total time: 3,161 clocks.
Total time: 3,170 clocks.
Total time: 3,165 clocks.
Total time (BEST RUN): 3,160 clocks.

G:\Lookupperorama_r13\COLLISION_Hashliner>BenchHashingLines_XXH128.exe 200000000.KnightTours.txt
Hashing 200,000,000 lines/keys, 128 bytes each, in RAM ...
The first key has hash:
c703b0bd77
Total time: 6,021 clocks.
Total time: 4,492 clocks.
Total time: 4,532 clocks.
Total time: 4,509 clocks.
Total time: 4,492 clocks.
Total time: 4,490 clocks.
Total time: 4,495 clocks.
Total time: 4,494 clocks.
Total time: 4,492 clocks.
Total time: 4,491 clocks.
Total time: 4,496 clocks.
Total time: 4,487 clocks.
Total time: 4,491 clocks.
Total time: 4,493 clocks.
Total time: 4,496 clocks.
Total time: 4,494 clocks.
Total time: 4,512 clocks.
Total time (BEST RUN): 4,487 clocks.

G:\Lookupperorama_r13\COLLISION_Hashliner>

On laptop 'Brutalitto' (Renoir AMD 4800H max turbo 4.3GHz, 64GB DDR4 3200MHz, Windows 10) Gumbotron_YMM hashes 3,266/2,484= 1.31x faster than XXH128:

D:\Lookupperorama_r13\COLLISION_Hashliner>GvsXXH.bat

D:\Lookupperorama_r13\COLLISION_Hashliner>if exist 200000000.KnightTours.txt goto Skip

D:\Lookupperorama_r13\COLLISION_Hashliner>BenchHashingLines_Gumbotron.exe 200000000.KnightTours.txt
Hashing 200,000,000 lines/keys, 128 bytes each, in RAM ...
The first key has hash:
f4b027e3ab
Total time: 3,313 clocks.
Total time: 2,485 clocks.
Total time: 2,500 clocks.
Total time: 2,484 clocks.
Total time: 2,501 clocks.
Total time: 2,500 clocks.
Total time: 2,500 clocks.
Total time: 2,500 clocks.
Total time: 2,500 clocks.
Total time: 2,500 clocks.
Total time: 2,485 clocks.
Total time: 2,500 clocks.
Total time: 2,500 clocks.
Total time: 2,516 clocks.
Total time: 2,500 clocks.
Total time: 2,485 clocks.
Total time: 2,500 clocks.
Total time (BEST RUN): 2,484 clocks.

D:\Lookupperorama_r13\COLLISION_Hashliner>BenchHashingLines_XXH128.exe 200000000.KnightTours.txt
Hashing 200,000,000 lines/keys, 128 bytes each, in RAM ...
The first key has hash:
c703b0bd77
Total time: 4,204 clocks.
Total time: 3,359 clocks.
Total time: 3,360 clocks.
Total time: 3,360 clocks.
Total time: 3,359 clocks.
Total time: 3,360 clocks.
Total time: 3,360 clocks.
Total time: 3,359 clocks.
Total time: 3,375 clocks.
Total time: 3,266 clocks.
Total time: 3,349 clocks.
Total time: 3,360 clocks.
Total time: 3,375 clocks.
Total time: 3,360 clocks.
Total time: 3,390 clocks.
Total time: 3,376 clocks.
Total time: 3,375 clocks.
Total time (BEST RUN): 3,266 clocks.

D:\Lookupperorama_r13\COLLISION_Hashliner>

The actual (used in the benchmark) AVX code is given below (the main loop i.e. the handler of multiples of 64bytes is 1ab-096+6=283 bytes long, in 45 instructions):

; mark_description "Intel(R) C++ Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140726";
; mark_description "-O3 -arch:avx -FeBenchHashingLines_Gumbotron.exe -FAcs -D_WIN32_ENVIRONMENT_ -D_N_DDAES -D_5";

DoubleDeuceAES_Gumbotron_YMM    PROC 
...
.B7.2::                         
  00096 c4 41 7a 6f 40 
        20               vmovdqu xmm8, XMMWORD PTR [32+r8]      
  0009c c4 41 7a 6f 38   vmovdqu xmm15, XMMWORD PTR [r8]        
  000a1 49 89 c1         mov r9, rax                            
  000a4 48 83 c2 c0      add rdx, -64                           
  000a8 48 ff c8         dec rax                                
  000ab c4 43 3d 18 68 
        30 01            vinsertf128 ymm13, ymm8, XMMWORD PTR [48+r8], 1 
  000b2 c4 e2 15 00 fb   vpshufb ymm7, ymm13, ymm3              
  000b7 c4 41 7e 7f 6d 
        20               vmovdqu YMMWORD PTR [32+r13], ymm13    
  000bd c4 63 fd 00 c7 
        4e               vpermq ymm8, ymm7, 78                  
  000c3 c4 41 7e 7f 45 
        40               vmovdqu YMMWORD PTR [64+r13], ymm8     
  000c9 c4 c2 71 dc 4d 
        40               vaesenc xmm1, xmm1, XMMWORD PTR [64+r13] 
  000cf c4 c2 71 dc 4d 
        50               vaesenc xmm1, xmm1, XMMWORD PTR [80+r13] 
  000d5 c4 43 05 18 70 
        10 01            vinsertf128 ymm14, ymm15, XMMWORD PTR [16+r8], 1 
  000dc 49 83 c0 40      add r8, 64                             
  000e0 c4 e2 0d 00 fb   vpshufb ymm7, ymm14, ymm3              
  000e5 c4 41 0d 60 fd   vpunpcklbw ymm15, ymm14, ymm13         
  000ea c4 41 0d 68 ed   vpunpckhbw ymm13, ymm14, ymm13         
  000ef c4 41 7e 7f 75 
        00               vmovdqu YMMWORD PTR [r13], ymm14       
  000f5 c4 41 7e 7f bd 
        80 00 00 00      vmovdqu YMMWORD PTR [128+r13], ymm15   
  000fe c4 41 7e 7f ad 
        a0 00 00 00      vmovdqu YMMWORD PTR [160+r13], ymm13   
  00107 c4 e3 fd 00 ff 
        4e               vpermq ymm7, ymm7, 78                  
  0010d c5 3d 60 f7      vpunpcklbw ymm14, ymm8, ymm7           
  00111 c4 c1 7e 7f 7d 
        60               vmovdqu YMMWORD PTR [96+r13], ymm7     
  00117 c5 bd 68 ff      vpunpckhbw ymm7, ymm8, ymm7            
  0011b c4 41 7e 7f b5 
        c0 00 00 00      vmovdqu YMMWORD PTR [192+r13], ymm14   
  00124 c4 c1 7e 7f bd 
        e0 00 00 00      vmovdqu YMMWORD PTR [224+r13], ymm7    
  0012d c4 c2 51 dc 6d 
        00               vaesenc xmm5, xmm5, XMMWORD PTR [r13]  
  00133 c4 c2 59 dc a5 
        80 00 00 00      vaesenc xmm4, xmm4, XMMWORD PTR [128+r13] 
  0013c c4 c2 51 dc 6d 
        10               vaesenc xmm5, xmm5, XMMWORD PTR [16+r13] 
  00142 c4 c2 69 dc 95 
        c0 00 00 00      vaesenc xmm2, xmm2, XMMWORD PTR [192+r13] 
  0014b c4 c2 59 dc a5 
        a0 00 00 00      vaesenc xmm4, xmm4, XMMWORD PTR [160+r13] 
  00154 c4 c2 51 dc 7d 
        20               vaesenc xmm7, xmm5, XMMWORD PTR [32+r13] 
  0015a c4 42 71 dc 45 
        60               vaesenc xmm8, xmm1, XMMWORD PTR [96+r13] 
  00160 c4 c2 69 dc 95 
        e0 00 00 00      vaesenc xmm2, xmm2, XMMWORD PTR [224+r13] 
  00169 c4 42 59 dc ad 
        90 00 00 00      vaesenc xmm13, xmm4, XMMWORD PTR [144+r13] 
  00172 c4 42 41 dc 7d 
        30               vaesenc xmm15, xmm7, XMMWORD PTR [48+r13] 
  00178 c4 c2 39 dc 4d 
        70               vaesenc xmm1, xmm8, XMMWORD PTR [112+r13] 
  0017e c4 42 69 dc b5 
        d0 00 00 00      vaesenc xmm14, xmm2, XMMWORD PTR [208+r13] 
  00187 c4 c2 11 dc a5 
        b0 00 00 00      vaesenc xmm4, xmm13, XMMWORD PTR [176+r13] 
  00190 c4 e2 01 dc e9   vaesenc xmm5, xmm15, xmm1              
  00195 c4 c2 09 dc 95 
        f0 00 00 00      vaesenc xmm2, xmm14, XMMWORD PTR [240+r13] 
  0019e c4 e2 51 dc fc   vaesenc xmm7, xmm5, xmm4               
  001a3 c4 e2 41 dc ea   vaesenc xmm5, xmm7, xmm2               
  001a8 4d 85 c9         test r9, r9                            
  001ab 0f 85 e5 fe ff 
        ff               jne .B7.2 
...
DoubleDeuceAES_Gumbotron_YMM ENDP

To me, above fragment is the fastest when comes to short keys (maybe up to 512 bytes) and multiples of 64 bytes, isn't it?

Okay, the AVX2 mainloop (the handler of multiples of 64 bytes) code is only 40 instructions (icc 19.0.0 -O3 -mavx2), should be the fastest: https://godbolt.org/z/oaW3zGcv5

        vmovdqu   ymm10, YMMWORD PTR [rdi]                      #94.43
        dec       rax                                           #89.9
        vmovdqu   ymm11, YMMWORD PTR [32+rdi]                   #95.43
        vpshufb   ymm8, ymm10, ymm0                             #101.12
        vpshufb   ymm7, ymm11, ymm0                             #100.12
        vpunpcklbw ymm9, ymm10, ymm11                           #191.12
        vpunpckhbw ymm12, ymm10, ymm11                          #192.12
        vmovdqu   YMMWORD PTR [64+rsp], ymm9                    #191.4
        vmovdqu   YMMWORD PTR [96+rsp], ymm12                   #192.4
        vpermq    ymm14, ymm7, 78                               #104.9
        add       rsi, -64                                      #494.22
        vpermq    ymm15, ymm8, 78                               #105.9
        vpunpcklbw ymm13, ymm14, ymm15                          #253.12
        vpunpckhbw ymm7, ymm14, ymm15                           #254.12
        vmovdqu   YMMWORD PTR [rsp], ymm14                      #104.1
        vmovdqu   YMMWORD PTR [32+rsp], ymm15                   #105.1
        vmovdqu   YMMWORD PTR [128+rsp], ymm13                  #253.4
        vmovdqu   YMMWORD PTR [160+rsp], ymm7                   #254.4
        vaesenc   xmm1, xmm1, XMMWORD PTR [rdi]                 #452.12
        vaesenc   xmm3, xmm3, XMMWORD PTR [rsp]                 #454.13
        vaesenc   xmm4, xmm4, XMMWORD PTR [64+rsp]              #457.12
        vaesenc   xmm3, xmm3, XMMWORD PTR [16+rsp]              #464.13
        vaesenc   xmm1, xmm1, XMMWORD PTR [16+rdi]              #462.12
        vaesenc   xmm7, xmm1, XMMWORD PTR [32+rdi]              #472.12
        vaesenc   xmm6, xmm6, XMMWORD PTR [128+rsp]             #459.12
        vaesenc   xmm4, xmm4, XMMWORD PTR [96+rsp]              #467.12
        vaesenc   xmm8, xmm3, XMMWORD PTR [32+rsp]              #474.13
        vaesenc   xmm6, xmm6, XMMWORD PTR [160+rsp]             #469.12
        vaesenc   xmm9, xmm4, XMMWORD PTR [80+rsp]              #477.12
        vaesenc   xmm3, xmm8, XMMWORD PTR [48+rsp]              #484.13
        vaesenc   xmm11, xmm7, XMMWORD PTR [48+rdi]             #482.12
        add       rdi, 64                                       #89.19
        vaesenc   xmm10, xmm6, XMMWORD PTR [144+rsp]            #479.12
        vaesenc   xmm4, xmm9, XMMWORD PTR [112+rsp]             #487.12
        vaesenc   xmm12, xmm11, xmm3                            #491.12
        vaesenc   xmm6, xmm10, XMMWORD PTR [176+rsp]            #489.12
        vaesenc   xmm13, xmm12, xmm4                            #492.12
        vaesenc   xmm1, xmm13, xmm6                             #493.12
        cmp       rax, -1                                       #89.9
        jne       ..B2.4        # Prob 82%                      #89.9

Didn't have time to benchmark the AVX2, currently don't see how to speed it up more...