Feedback: Speed/Collision Benchmark - Gumbotron vs XXH3

Enter Gumbotron (a.k.a. DoubleDeuceAES_128bits) ...

My not-so-thorough runs below show that the fastest 128bit hasher is XXH3_128bits, known to me, yet: In my view, my Gumbotron is faster, when:

for short keys, something like up to 512 bytes;
the key lengths have to be multiples of 64 bytes (or 16);
farther optimized, see the Assembly at the bottom, and its potential to be written as a ZMM etude, since it fetches 64 bytes per cycle and _mm512_shuffle_epi8(m512i a, m512i b) is waiting in the wings.

Imagine the usecase where 1 terakeys are enforced (trillion, yes). And those keys are 64 bytes or 128 bytes or 256 bytes long, and they have to be put into leaves of Bayer-Trees, the size becomes nasty unless they are "compressed", for instance 1Terakeys x 64 bytes = 64TB, but compressed only 16TB. Therefore, I wrote Gumbotron.

The need for speed and "lossy compression" (i.e. shrinking keys) led me to putting a lookupper and a shrinker under one hood, namely the 128bit hasher DoubleDeuceAES_128bits. There is no principal distinction between a lookupper and a shrinker, they both are hashers, but the latter serves as a checksum whereas the former as a hashtable.

The benchmark package (allowing to reproduce all the stuff) is freely downloadable with all the sources and binaries: www.sanmayce.com/Lookupperorama_r11.zip www.sanmayce.com/Gumbotron_logo.pdf

The benchmark is of two parts:

Lookupperorama - non-synthetic (indexing all positions of a file) stress test for speed and collisions;
COLLISION_Hashliner - synthetic stress test for collisions - hashes all lines from a given file and reports duplicative hashes.

As a quick test I chose randomly two testfiles - Cihai and Judaica.

Testfile: KAZE_(Dictionary_SpecificationLanguage(ABBYY_Software_House))_Hanyu_Cihai_newSea-of-Words(Zho-Zho).dsl (42,920,232 bytes) Testmachine: Testmachine: laptop 'Brutalitto' AMD 4800H max turbo 4.3GHz, 64GB DDR4 3200MHz, Windows 10 Hashtable: 26bit, i.e. 67,108,864 slots, greater than (42,920,232 bytes), since in case of perfect hasher - slots should be more than the keys (could be all unique) at each position

+--------------------------+-----------------------------+----------------------------------+---------------------------------+
| Hasher,                  | Number Of Hash Collisions = | RAW Hashing Speed (in one pass,  | Linear Hashing Speed,           |
| GCC-10.1 compiler        | Distinct Keys -             | at each position) for keys       | the whole file as one key       |
| -O3 -mavx                | Number Of Trees             | 4,6,8,10,12,14,16,18,36,64 bytes |                                 |
+--------------------------+-----------------------------+----------------------------------+---------------------------------+
| XXH3_64bits v0.8.0       |                  41,108,202 |      295,187,276 KEYS-PER-SECOND | 21,786,919,796 BYTES-PER-SECOND |
| CRC32C (_mm_crc32_u32)   |                  41,109,478 |      274,426,023 KEYS-PER-SECOND |  5,241,205,519 BYTES-PER-SECOND |
| XXH3_128bits v0.8.0      |                  41,111,196 |      214,493,903 KEYS-PER-SECOND | 20,331,706,300 BYTES-PER-SECOND |
| SHA3-224                 |                  41,111,291 |          153,854 KEYS-PER-SECOND |     22,319,413 BYTES-PER-SECOND |
| wyhash final             |                  41,112,870 |      449,897,589 KEYS-PER-SECOND | 15,086,197,539 BYTES-PER-SECOND |
| DoubleDeuceAES_Gumbotron |                  41,117,352 |      204,869,832 KEYS-PER-SECOND |  8,690,065,195 BYTES-PER-SECOND |
| FNV1A_Pippip             |                  41,488,327 |      449,897,589 KEYS-PER-SECOND |  8,101,214,043 BYTES-PER-SECOND |
+--------------------------+-----------------------------+----------------------------------+---------------------------------+

Note1: The second column houses the cumulative value for all collisions, the collisions for all orders 4..64 were summed, that is. Note2: Folding of those 128bits should lessen the collisions.

Testfile: TERAPIG_EncyclopaediaJudaica(in_22_volumes)_TXT.tar (107,784,192 bytes) Testmachine: Testmachine: laptop 'Brutalitto' AMD 4800H max turbo 4.3GHz, 64GB DDR4 3200MHz, Windows 10 Hashtable: 27bit, i.e. 134,217,728 slots, greater than (107,784,192 bytes), since in case of perfect hasher - slots should be more than the keys (could be all unique) at each position

+--------------------------+-----------------------------+----------------------------------+---------------------------------+
| Hasher,                  | Number Of Hash Collisions = | RAW Hashing Speed (in one pass,  | Linear Hashing Speed,           |
| GCC-10.1 compiler        | Distinct Keys -             | at each position) for keys       | the whole file as one key       |
| -O3 -mavx                | Number Of Trees             | 4,6,8,10,12,14,16,18,36,64 bytes |                                 |
+--------------------------+-----------------------------+----------------------------------+---------------------------------+
| DoubleDeuceAES_Gumbotron |                 135,752,271 |      204,640,573 KEYS-PER-SECOND |  8,742,330,440 BYTES-PER-SECOND |
| XXH3_128bits v0.8.0      |                 135,756,978 |      212,843,977 KEYS-PER-SECOND | 22,539,563,362 BYTES-PER-SECOND |
| wyhash final             |                 135,762,454 |      442,100,861 KEYS-PER-SECOND | 14,959,638,029 BYTES-PER-SECOND |
| XXH3_64bits v0.8.0       |                 135,763,366 |      290,994,033 KEYS-PER-SECOND | 22,464,400,166 BYTES-PER-SECOND |
| CRC32C (_mm_crc32_u32)   |                 135,764,628 |      252,599,460 KEYS-PER-SECOND |  5,241,402,061 BYTES-PER-SECOND |
| FNV1A_Pippip             |                 135,768,302 |      450,602,801 KEYS-PER-SECOND |  8,048,401,433 BYTES-PER-SECOND |
| SHA3-224                 |                 135,771,905 |          153,841 KEYS-PER-SECOND |     22,246,479 BYTES-PER-SECOND |
+--------------------------+-----------------------------+----------------------------------+---------------------------------+

Another twist, in order to test collisions, here comes my 1 trillion 128bytes long keys testbed, since no enough memory is available, it was run as 1 billion.

Testset: "A billion Knight-Tours variants (each KT with 256 variants, the KT itself omitted) - each 128 bytes long" Testfile: 1000000000.KnightTours.txt (130,000,000,000 bytes)

The name of the game - hashing all lines and taking either 5 bytes or 6,7,8 bytes from the hash.

+------------------------+----------------------------------------------------------+
| Hasher                 |                          Collisions within first 5 bytes |
+------------------------+----------------------------------------------------------+
| XXH3_64bits v0.8.0     | 1,000,000,000 - 999,545,727 distinct lines =     454,273 |
| DoubleDeuceAES_128bits | 1,000,000,000 - 999,545,796 distinct lines =     454,204 |  
+------------------------+----------------------------------------------------------+

+------------------------+----------------------------------------------------------+
| Hasher                 |                          Collisions within first 6 bytes |
+------------------------+----------------------------------------------------------+
| XXH3_64bits v0.8.0     | 1,000,000,000 - 999,998,214 distinct lines =       1,786 |
| DoubleDeuceAES_128bits | 1,000,000,000 - 999,998,213 distinct lines =       1,787 |
+------------------------+----------------------------------------------------------+

+------------------------+----------------------------------------------------------+
| Hasher                 |                          Collisions within first 7 bytes |
+------------------------+----------------------------------------------------------+
| XXH3_64bits v0.8.0     | 1,000,000,000 - 999,999,989 distinct lines =          11 |
| DoubleDeuceAES_128bits | 1,000,000,000 - 999,999,994 distinct lines =           6 | 
+------------------------+----------------------------------------------------------+

+------------------------+----------------------------------------------------------+
| Hasher                 |                          Collisions within first 8 bytes |
+------------------------+----------------------------------------------------------+
| XXH3_64bits v0.8.0     | 1,000,000,000 - 1,000,000,000 distinct lines =         0 |
| DoubleDeuceAES_128bits | 1,000,000,000 - 1,000,000,000 distinct lines =         0 | 
+------------------------+----------------------------------------------------------+

This is how the console looks like:

C:\test\COLLISION_Hashliner>GENERATE_Xmillion_Knight-Tours.bat 1000000000
Generating 1000000000 Knight-Tours and dumping them into file ...

C:\test\COLLISION_Hashliner>Knight-Tour_FNV1A_YoshimitsuTRIADii_vs_CRC32_TRISMUS.exe a8 1000000000  1>1000000000.KnightTours.txt

C:\test\COLLISION_Hashliner>bench7.bat 1000000000.KnightTours.txt

C:\test\COLLISION_Hashliner>Hashliner_XXH3_dump7byteshash.exe 1000000000.KnightTours.txt  1>1000000000.KnightTours.txt.xxh3.txt

C:\test\COLLISION_Hashliner>Hashliner_DDAES_dump7byteshash.exe 1000000000.KnightTours.txt  1>1000000000.KnightTours.txt.DDAES.txt

C:\test\COLLISION_Hashliner>Sandokan_QuickSortExternal_Deduplicated_4+GB_64bit_Intel.exe 1000000000.KnightTours.txt.xxh3.txt /fast /descend 3000
Sandokan_QuickSortExternal_4+GB r.3+, written by Kaze, using Bill Durango's Quicksort source.
Size of input file: 16,000,000,000
Counting lines ...
Lines encountered: 1,000,000,000
Longest line (including CR if present): 15
Allocated memory for pointers-to-lines in MB: 7629
Assigning pointers ...
sizeof(int), sizeof(void*): 4, 8
Trying to allocate memory for the file itself in MB: 15258 ... OK! Get on with fast internal accesses.
Uploading ...
Sorting 1,000,000,000 Pointers ...
Quicksort (Insertionsort for small blocks) commenced ...
/ RightEnd: 000,328,304,267; NumberOfSplittings: 0,114,284,204; Done: 100% ...
NumberOfComparisons: 34,310,536,510
The time to sort 1,000,000,000 items via Quicksort+Insertionsort was 2,848,402 clocks.
Performance: 12,045,534 Comparisons_128B_long-Per-Second i.e 24,091,068 RandomReads_128B_long-Per-Second.
Dumping the sorted data (Regime=2)...
\ Done 100% ...
Dumped 1,000,000,000 lines.
OK! Incoming and resultant file's sizes match.
Dumping the sorted data [deduplicated] ...
Dumped 999,999,989 distinct lines.
Dump time: 460,940 clocks.
Total time: 3,347,265 clocks.
Performance: 4,780 bytes/clock.
Done successfully.

C:\test\COLLISION_Hashliner>sort /R QuickSortExternal_4+GB.distinct.txt  1>1000000000.KnightTours.txt.xxh3.7bytes.2orABOVE.txt

C:\test\COLLISION_Hashliner>Sandokan_QuickSortExternal_Deduplicated_4+GB_64bit_Intel.exe 1000000000.KnightTours.txt.DDAES.txt /fast /descend 3000
Sandokan_QuickSortExternal_4+GB r.3+, written by Kaze, using Bill Durango's Quicksort source.
Size of input file: 16,000,000,000
Counting lines ...
Lines encountered: 1,000,000,000
Longest line (including CR if present): 15
Allocated memory for pointers-to-lines in MB: 7629
Assigning pointers ...
sizeof(int), sizeof(void*): 4, 8
Trying to allocate memory for the file itself in MB: 15258 ... OK! Get on with fast internal accesses.
Uploading ...
Sorting 1,000,000,000 Pointers ...
Quicksort (Insertionsort for small blocks) commenced ...
- RightEnd: 000,759,555,061; NumberOfSplittings: 0,114,282,509; Done: 100% ...
NumberOfComparisons: 34,551,039,764
The time to sort 1,000,000,000 items via Quicksort+Insertionsort was 2,896,271 clocks.
Performance: 11,929,487 Comparisons_128B_long-Per-Second i.e 23,858,974 RandomReads_128B_long-Per-Second.
Dumping the sorted data (Regime=2)...
\ Done 100% ...
Dumped 1,000,000,000 lines.
OK! Incoming and resultant file's sizes match.
Dumping the sorted data [deduplicated] ...
Dumped 999,999,994 distinct lines.
Dump time: 458,406 clocks.
Total time: 3,393,196 clocks.
Performance: 4,715 bytes/clock.
Done successfully.

C:\test\COLLISION_Hashliner>sort /R QuickSortExternal_4+GB.distinct.txt  1>1000000000.KnightTours.txt.DDAES.7bytes.2orABOVE.txt

C:\test\COLLISION_Hashliner>dir *7b*

15-Aug-21  12:28               156 1000000000.KnightTours.txt.DDAES.7bytes.2orABOVE.txt
15-Aug-21  11:32               286 1000000000.KnightTours.txt.xxh3.7bytes.2orABOVE.txt

C:\test\COLLISION_Hashliner>type 1000000000.KnightTours.txt.xxh3.7bytes.2orABOVE.txt
0,000,002       f84627e722e85e
0,000,002       f0039d0c4e4fce
0,000,002       c87c64d97df0e7
0,000,002       bb4344a5546572
0,000,002       af2f628f4b3ffb
0,000,002       a8cb8675c94610
0,000,002       a742cf83948622
0,000,002       657cb9dff2d962
0,000,002       436aef7ab54ce7
0,000,002       270fcde0563670
0,000,002       0b533d70915c51

C:\test\COLLISION_Hashliner>

Note: The first column houses how many occurrences of the following hash are there.

TO-DO: Wish running it with 1,000,000,000,000 (did it already, but in other game) keys - could help evaluating how badly a 128bit hash results in collision(s)...

And finally the function itself: https://godbolt.org/z/1o38b4W1K

// https://godbolt.org/ [
/*
#include <stdlib.h>
#include <stdint.h> //uint64_t needed
#include <string.h> 
#include <smmintrin.h> //SSE4.1 intrinsics
#include <wmmintrin.h>
void SlowCopy128bit (const char *SOURCE, char *TARGET) { _mm_storeu_si128((__m128i *)(TARGET), _mm_loadu_si128((const __m128i *)(SOURCE))); }
unsigned char DDAES[16];
//static const uint8_t VectorsNeedNonVAriable1[256] __attribute__((aligned(16))) =
static const uint8_t VectorsNeedNonVAriable1[256] =
{
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
    0xFF,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0xFF,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0x00,0x00,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0x00,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0x00,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0x00,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0x00,0x00,
    0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0x00
};
static const __m128i *Mumbotron = (__m128i *) VectorsNeedNonVAriable1;
//static const uint8_t VectorsNeedNonVAriable2[256] __attribute__((aligned(16))) =
static const uint8_t VectorsNeedNonVAriable2[256] =
{
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
    0x00,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0x00,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0x00,0x00,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xFF,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xFF,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xFF,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xFF,0xFF,
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xFF
}; 
static const __m128i *Jumbotron = (__m128i *) VectorsNeedNonVAriable2;
// Written by Sanmayce, inspired by J. Andrew Rogers's https://github.com/jandrewrogers/AquaHash/blob/master/aquahash.h
// This hash function serves two ... functions - useful for table lookups and to shrink keys (usually 64...256 bytes in length) down to 16 bytes:
// Inhere using (when the key is not a multiple of 16, therefore padding is needed) Pippip's approach to read past the end ("the dirty" sentinel like style, or more like padding):
void DoubleDeuceAES_Gumbotron(const uint8_t *buffer, size_t length) {
    size_t i, Cycles;
    __m128i hashA = _mm_set_epi64x(0x6c62272e07bb0142, 0x62b821756295c58d); // 0x6c62272e07bb014262b821756295c58d // _mm_setzero_si128();
    __m128i hashB = _mm_set_epi64x(0xdd268dbcaac55036, 0x2d98c384c4e576cc); // 0xdd268dbcaac550362d98c384c4e576ccc8b1536847b6bbb31023b4c8caee0535 // FNV offset basis
    __m128i hashC = _mm_set_epi64x(0xc8b1536847b6bbb3, 0x1023b4c8caee0535); // 0xdd268dbcaac550362d98c384c4e576ccc8b1536847b6bbb31023b4c8caee0535 // FNV offset basis
    __m128i hashD = _mm_setzero_si128();
    __m128i a0,a1,a2,a3; // Instead of this chunkenization, ZMM houses the 4 XMMs, if there is shuffle across all the 512bits, use it. There is, but __m256i _mm256_shuffle_epi8(__m256i a, __m256i b) is more handy.
    __m128i b0,b1,b2,b3;
    __m128i c0,c1,c2,c3;
    __m128i d0,d1,d2,d3;
    __m128i tmp0,tmp1,tmp2,tmp3;
    __m128i ReverseMask =   _mm_set_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
    __m128i PartialInterleavingMask1 = _mm_set_epi8(0x80,7,0x80,6,0x80,5,0x80,4,0x80,3,0x80,2,0x80,1,0x80,0);
    __m128i PartialInterleavingMask2 = _mm_set_epi8(0x80,0xf,0x80,0xe,0x80,0xd,0x80,0xc,0x80,0xb,0x80,0xa,0x80,9,0x80,8); 
    __m128i PartialInterleavingMask3 = _mm_set_epi8(7,0x80,6,0x80,5,0x80,4,0x80,3,0x80,2,0x80,1,0x80,0,0x80);
    __m128i PartialInterleavingMask4 = _mm_set_epi8(0xf,0x80,0xe,0x80,0xd,0x80,0xc,0x80,0xb,0x80,0xa,0x80,9,0x80,8,0x80);
    const __m128i *ptr128a, *ptr128b, *ptr128c, *ptr128d;

    __m128i AgainstRules, GumbotronREVER, GumbotronINTER, Gumbotron, GumbotronREVERINTER;
    const __m128i *ptr128; 
    __m128i InterleaveMask =   _mm_set_epi8(15,7,14,6,13,5,12,4,11,3,10,2,9,1,8,0);

    if (length >= 64) {
        Cycles = length/64;
        for(; Cycles--; buffer += 64) {
            a0 = _mm_loadu_si128((__m128i *)(buffer+0*16)); 
            a1 = _mm_loadu_si128((__m128i *)(buffer+1*16)); 
            a2 = _mm_loadu_si128((__m128i *)(buffer+2*16)); 
            a3 = _mm_loadu_si128((__m128i *)(buffer+3*16)); 
            b0 = _mm_shuffle_epi8 (a3, ReverseMask); 
            b1 = _mm_shuffle_epi8 (a2, ReverseMask); 
            b2 = _mm_shuffle_epi8 (a1, ReverseMask); 
            b3 = _mm_shuffle_epi8 (a0, ReverseMask); 
            tmp0 = _mm_shuffle_epi8 (a0, PartialInterleavingMask1);
            tmp1 = _mm_shuffle_epi8 (a0, PartialInterleavingMask2);
            tmp2 = _mm_shuffle_epi8 (a2, PartialInterleavingMask3);
            tmp3 = _mm_shuffle_epi8 (a2, PartialInterleavingMask4);
            c0 = _mm_or_si128 (tmp0, tmp2);
            c1 = _mm_or_si128 (tmp1, tmp3);
            tmp0 = _mm_shuffle_epi8 (a1, PartialInterleavingMask1);
            tmp1 = _mm_shuffle_epi8 (a1, PartialInterleavingMask2);
            tmp2 = _mm_shuffle_epi8 (a3, PartialInterleavingMask3);
            tmp3 = _mm_shuffle_epi8 (a3, PartialInterleavingMask4);
            c2 = _mm_or_si128 (tmp0, tmp2);
            c3 = _mm_or_si128 (tmp1, tmp3);
            tmp0 = _mm_shuffle_epi8 (b0, PartialInterleavingMask1);
            tmp1 = _mm_shuffle_epi8 (b0, PartialInterleavingMask2);
            tmp2 = _mm_shuffle_epi8 (b2, PartialInterleavingMask3);
            tmp3 = _mm_shuffle_epi8 (b2, PartialInterleavingMask4);
            d0 = _mm_or_si128 (tmp0, tmp2);
            d1 = _mm_or_si128 (tmp1, tmp3);
            tmp0 = _mm_shuffle_epi8 (b1, PartialInterleavingMask1);
            tmp1 = _mm_shuffle_epi8 (b1, PartialInterleavingMask2);
            tmp2 = _mm_shuffle_epi8 (b3, PartialInterleavingMask3);
            tmp3 = _mm_shuffle_epi8 (b3, PartialInterleavingMask4);
            d2 = _mm_or_si128 (tmp0, tmp2);
            d3 = _mm_or_si128 (tmp1, tmp3);

            hashA = _mm_aesenc_si128(hashA, a0);
            hashB = _mm_aesenc_si128(hashB, b0);
            hashC = _mm_aesenc_si128(hashC, c0);
            hashD = _mm_aesenc_si128(hashD, d0);

            hashA = _mm_aesenc_si128(hashA, a1);
            hashB = _mm_aesenc_si128(hashB, b1);
            hashC = _mm_aesenc_si128(hashC, c1);
            hashD = _mm_aesenc_si128(hashD, d1);

            hashA = _mm_aesenc_si128(hashA, a2);
            hashB = _mm_aesenc_si128(hashB, b2);
            hashC = _mm_aesenc_si128(hashC, c2);
            hashD = _mm_aesenc_si128(hashD, d2);

            hashA = _mm_aesenc_si128(hashA, a3);
            hashB = _mm_aesenc_si128(hashB, b3);
            hashC = _mm_aesenc_si128(hashC, c3);
            hashD = _mm_aesenc_si128(hashD, d3);

            hashA = _mm_aesenc_si128(hashA, hashB);
            hashA = _mm_aesenc_si128(hashA, hashC);
            hashA = _mm_aesenc_si128(hashA, hashD);
            length = length - 64;
        }
    }

    ptr128 = (__m128i *)buffer;
    if (length >=16) {
        Cycles = length/16;
        for(; Cycles--; buffer += 16) {
            AgainstRules = _mm_loadu_si128(ptr128++);
            GumbotronREVER = _mm_shuffle_epi8 (AgainstRules, ReverseMask);
            GumbotronINTER = _mm_shuffle_epi8 (AgainstRules, InterleaveMask);
            GumbotronREVERINTER = _mm_shuffle_epi8 (GumbotronREVER, InterleaveMask);
            hashA = _mm_aesenc_si128(hashA, AgainstRules);
            hashB = _mm_aesenc_si128(hashB, GumbotronREVER);
            hashC = _mm_aesenc_si128(hashC, GumbotronINTER);
            hashD = _mm_aesenc_si128(hashD, GumbotronREVERINTER);
            hashA = _mm_aesenc_si128(hashA, hashB);
            hashA = _mm_aesenc_si128(hashA, hashC);
            hashA = _mm_aesenc_si128(hashA, hashD);
            length = length - 16;
        }
    } 
    // Inhere using Pippip's approach to read past the end ("the dirty" sentinel like style, or more like padding):
    if (length&(16-1)) {
        AgainstRules = _mm_loadu_si128(ptr128);     
        //AgainstRules = _mm_srli_si128 (AgainstRules, 16-length); // catastrophic error: Intrinsic parameter must be an immediate value
        AgainstRules = _mm_and_si128 (AgainstRules, Mumbotron[length]);
        //Gumbotron = _mm_slli_si128 (Gumbotron, 16-length); // catastrophic error: Intrinsic parameter must be an immediate value
        Gumbotron = _mm_and_si128 (hashB, Jumbotron[length]);
        AgainstRules = _mm_or_si128 (AgainstRules, Gumbotron);

        GumbotronREVER = _mm_shuffle_epi8 (AgainstRules, ReverseMask);
        GumbotronINTER = _mm_shuffle_epi8 (AgainstRules, InterleaveMask);
        GumbotronREVERINTER = _mm_shuffle_epi8 (GumbotronREVER, InterleaveMask);
            hashA = _mm_aesenc_si128(hashA, AgainstRules);
            hashB = _mm_aesenc_si128(hashB, GumbotronREVER);
            hashC = _mm_aesenc_si128(hashC, GumbotronINTER);
            hashD = _mm_aesenc_si128(hashD, GumbotronREVERINTER);
            hashA = _mm_aesenc_si128(hashA, hashB);
            hashA = _mm_aesenc_si128(hashA, hashC);
            hashA = _mm_aesenc_si128(hashA, hashD);
    }
    SlowCopy128bit( (const char *)(&hashA), (char *)&DDAES[0]);
}
*/
// https://godbolt.org/ ]

// gcc 10.1 -O3 -mavx -maes

SlowCopy128bit(char const*, char*):
        vmovdqu xmm0, XMMWORD PTR [rdi]
        vmovdqu XMMWORD PTR [rsi], xmm0
        ret
DoubleDeuceAES_Gumbotron(unsigned char const*, unsigned long):
        sub     rsp, 96
        cmp     rsi, 63
        jbe     .L9
        vmovdqa xmm5, XMMWORD PTR .LC0[rip]
        mov     rax, rsi
        mov     rdx, rsi
        vpxor   xmm6, xmm6, xmm6
        and     rax, -64
        vmovdqa xmm10, XMMWORD PTR .LC1[rip]
        vmovdqa xmm12, XMMWORD PTR .LC2[rip]
        shr     rdx, 6
        vmovdqa xmm13, XMMWORD PTR .LC3[rip]
        vmovdqa xmm15, XMMWORD PTR .LC6[rip]
        add     rax, rdi
        vmovdqa XMMWORD PTR [rsp-120], xmm6
        vmovdqa xmm14, XMMWORD PTR .LC7[rip]
        vmovdqa XMMWORD PTR [rsp-104], xmm5
.L5:
        vmovdqu xmm0, XMMWORD PTR [rdi]
        vmovdqu xmm11, XMMWORD PTR [rdi+32]
        add     rdi, 64
        vmovdqu xmm4, XMMWORD PTR [rdi-32]
        vmovdqu xmm7, XMMWORD PTR [rdi-16]
        vpshufb xmm2, xmm0, XMMWORD PTR .LC5[rip]
        vpshufb xmm9, xmm11, xmm14
        vpshufb xmm3, xmm0, XMMWORD PTR .LC4[rip]
        vpshufb xmm8, xmm4, xmm13
        vpshufb xmm4, xmm0, xmm13
        vmovdqa XMMWORD PTR [rsp-72], xmm2
        vpshufb xmm2, xmm11, xmm15
        vmovdqu xmm11, XMMWORD PTR [rdi-48]
        vpshufb xmm6, xmm4, xmm15
        vpshufb xmm1, xmm7, xmm13
        vmovdqa XMMWORD PTR [rsp-56], xmm9
        vmovdqa XMMWORD PTR [rsp+56], xmm6
        vmovdqu xmm9, XMMWORD PTR [rdi-16]
        vpshufb xmm6, xmm4, xmm14
        vpshufb xmm5, xmm8, XMMWORD PTR .LC4[rip]
        vmovdqa XMMWORD PTR [rsp-88], xmm3
        vmovdqu xmm3, XMMWORD PTR [rdi-48]
        vpor    xmm2, xmm2, XMMWORD PTR [rsp-88]
        vpshufb xmm11, xmm11, XMMWORD PTR .LC4[rip]
        vmovdqa XMMWORD PTR [rsp-40], xmm11
        vmovdqu xmm11, XMMWORD PTR [rdi-16]
        vpshufb xmm9, xmm9, xmm15
        vmovdqu xmm7, XMMWORD PTR [rdi-48]
        vmovdqa XMMWORD PTR [rsp+72], xmm6
        vaesenc xmm12, xmm12, xmm1
        vaesenc xmm2, xmm10, xmm2
        vmovdqa xmm6, XMMWORD PTR [rsp-104]
        vmovdqa XMMWORD PTR [rsp-8], xmm9
        vpshufb xmm11, xmm11, xmm14
        vpshufb xmm7, xmm7, xmm13
        vpshufb xmm3, xmm3, XMMWORD PTR .LC5[rip]
        vaesenc xmm0, xmm6, xmm0
        vmovdqa XMMWORD PTR [rsp-24], xmm3
        vpshufb xmm9, xmm1, XMMWORD PTR .LC5[rip]
        vmovdqa xmm6, XMMWORD PTR [rsp-120]
        vaesenc xmm0, xmm0, XMMWORD PTR [rdi-48]
        vmovdqa XMMWORD PTR [rsp+8], xmm11
        vaesenc xmm12, xmm12, xmm8
        vpshufb xmm3, xmm1, XMMWORD PTR .LC4[rip]
        vmovdqa XMMWORD PTR [rsp+24], xmm9
        vpshufb xmm11, xmm7, xmm15
        vpshufb xmm9, xmm7, xmm14
        vaesenc xmm12, xmm12, xmm7
        vmovdqa XMMWORD PTR [rsp+40], xmm5
        vpor    xmm3, xmm3, xmm11
        vaesenc xmm12, xmm12, xmm4
        vpshufb xmm5, xmm8, XMMWORD PTR .LC5[rip]
        vmovdqa xmm1, XMMWORD PTR [rsp-56]
        vpor    xmm10, xmm1, XMMWORD PTR [rsp-72]
        vaesenc xmm3, xmm6, xmm3
        vaesenc xmm0, xmm0, XMMWORD PTR [rdi-32]
        vmovdqa xmm7, XMMWORD PTR [rsp-8]
        vpor    xmm9, xmm9, XMMWORD PTR [rsp+24]
        vaesenc xmm0, xmm0, XMMWORD PTR [rdi-16]
        vaesenc xmm2, xmm2, xmm10
        vpor    xmm10, xmm7, XMMWORD PTR [rsp-40]
        vmovdqa xmm4, XMMWORD PTR [rsp+8]
        vmovdqa xmm7, XMMWORD PTR [rsp+56]
        vpor    xmm6, xmm7, XMMWORD PTR [rsp+40]
        vaesenc xmm3, xmm3, xmm9
        vaesenc xmm2, xmm2, xmm10
        vpor    xmm10, xmm4, XMMWORD PTR [rsp-24]
        vmovdqa xmm4, XMMWORD PTR [rsp+72]
        vaesenc xmm3, xmm3, xmm6
        vaesenc xmm10, xmm2, xmm10
        vpor    xmm6, xmm4, xmm5
        vaesenc xmm5, xmm0, xmm12
        vaesenc xmm1, xmm3, xmm6
        vaesenc xmm5, xmm5, xmm10
        vaesenc xmm6, xmm5, xmm1
        vmovdqa XMMWORD PTR [rsp-120], xmm1
        vmovdqa XMMWORD PTR [rsp-104], xmm6
        cmp     rdi, rax
        jne     .L5
        sal     rdx, 6
        vmovdqa xmm5, xmm6
        vmovdqa xmm6, xmm1
        sub     rsi, rdx
        cmp     rsi, 15
        ja      .L16
.L6:
        test    sil, 15
        je      .L8
        sal     rsi, 4
        vmovdqu xmm0, XMMWORD PTR [rdi]
        vmovdqa xmm2, XMMWORD PTR .LC8[rip]
        vpand   xmm1, xmm12, XMMWORD PTR VectorsNeedNonVAriable2[rsi]
        vpand   xmm0, xmm0, XMMWORD PTR VectorsNeedNonVAriable1[rsi]
        vpor    xmm0, xmm0, xmm1
        vpshufb xmm1, xmm0, XMMWORD PTR .LC3[rip]
        vpshufb xmm3, xmm0, xmm2
        vaesenc xmm0, xmm5, xmm0
        vpshufb xmm2, xmm1, xmm2
        vaesenc xmm1, xmm12, xmm1
        vaesenc xmm3, xmm10, xmm3
        vaesenc xmm1, xmm0, xmm1
        vaesenc xmm2, xmm6, xmm2
        vaesenc xmm0, xmm1, xmm3
        vaesenc xmm5, xmm0, xmm2
.L8:
        vmovdqa XMMWORD PTR DDAES[rip], xmm5
        add     rsp, 96
        ret
.L9:
        vmovdqa xmm5, XMMWORD PTR .LC0[rip]
        vmovdqa xmm10, XMMWORD PTR .LC1[rip]
        vpxor   xmm6, xmm6, xmm6
        vmovdqa xmm12, XMMWORD PTR .LC2[rip]
        cmp     rsi, 15
        jbe     .L6
.L16:
        mov     rax, rsi
        mov     rdx, rsi
        vmovdqa xmm13, XMMWORD PTR .LC3[rip]
        vmovdqa xmm2, XMMWORD PTR .LC8[rip]
        and     rax, -16
        shr     rdx, 4
        add     rax, rdi
.L7:
        vmovdqu xmm0, XMMWORD PTR [rdi]
        add     rdi, 16
        vpshufb xmm1, xmm0, xmm13
        vpshufb xmm4, xmm0, xmm2
        vaesenc xmm5, xmm5, xmm0
        vaesenc xmm12, xmm12, xmm1
        vpshufb xmm3, xmm1, xmm2
        vaesenc xmm10, xmm10, xmm4
        vaesenc xmm5, xmm5, xmm12
        vaesenc xmm6, xmm6, xmm3
        vaesenc xmm5, xmm5, xmm10
        vaesenc xmm5, xmm5, xmm6
        cmp     rax, rdi
        jne     .L7
        sal     rdx, 4
        sub     rsi, rdx
        jmp     .L6
...

// icc 19.0.0 -O3 -mavx

SlowCopy128bit(char const*, char*):
        vmovdqu   xmm0, XMMWORD PTR [rdi]                       #6.129
        vmovdqu   XMMWORD PTR [rsi], xmm0                       #6.86
        ret                                                     #6.141
DoubleDeuceAES_Gumbotron(unsigned char const*, unsigned long):
        vmovups   xmm8, XMMWORD PTR .L_2il0floatpacket.0[rip]   #52.18
        vpxor     xmm15, xmm15, xmm15                           #55.18
        vmovups   xmm4, XMMWORD PTR .L_2il0floatpacket.1[rip]   #53.18
        vmovups   xmm1, XMMWORD PTR .L_2il0floatpacket.2[rip]   #54.18
        vmovdqu   xmm0, XMMWORD PTR .L_2il0floatpacket.3[rip]   #61.26
        vmovdqu   xmm7, XMMWORD PTR .L_2il0floatpacket.4[rip]   #62.37
        vmovdqu   xmm5, XMMWORD PTR .L_2il0floatpacket.5[rip]   #63.37
        vmovdqu   xmm6, XMMWORD PTR .L_2il0floatpacket.6[rip]   #64.37
        vmovdqu   xmm3, XMMWORD PTR .L_2il0floatpacket.7[rip]   #65.37
        vmovdqu   xmm2, XMMWORD PTR .L_2il0floatpacket.8[rip]   #70.29
        cmp       rsi, 64                                       #72.16
        jb        ..B2.6        # Prob 50%                      #72.16
        mov       rax, rsi                                      #73.19
        shr       rax, 6                                        #73.19
        dec       rax                                           #74.9
        cmp       rax, -1                                       #74.9
        je        ..B2.7        # Prob 10%                      #74.9
..B2.4:                         # Preds ..B2.2 ..B2.4
        vmovdqu   xmm7, XMMWORD PTR [48+rdi]                    #78.37
        vmovdqu   xmm0, XMMWORD PTR [rdi]                       #75.37
        vmovdqu   xmm5, XMMWORD PTR [16+rdi]                    #76.37
        vmovdqu   xmm3, XMMWORD PTR [32+rdi]                    #77.37
        vmovdqu   xmm12, XMMWORD PTR .L_2il0floatpacket.3[rip]  #79.9
        dec       rax                                           #74.9
        vpshufb   xmm9, xmm7, xmm12                             #79.9
        vpshufb   xmm13, xmm3, xmm12                            #80.9
        vaesenc   xmm10, xmm8, xmm0                             #108.12
        add       rsi, -64                                      #131.22
        vaesenc   xmm8, xmm10, xmm5                             #113.12
        add       rdi, 64                                       #74.19
        vaesenc   xmm11, xmm4, xmm9                             #109.12
        vaesenc   xmm4, xmm8, xmm3                              #118.12
        vpshufb   xmm8, xmm5, xmm12                             #81.9
        vpshufb   xmm12, xmm0, xmm12                            #82.9
        vaesenc   xmm14, xmm11, xmm13                           #114.12
        vaesenc   xmm14, xmm14, xmm8                            #119.12
        vaesenc   xmm11, xmm4, xmm7                             #123.12
        vaesenc   xmm10, xmm14, xmm12                           #124.12
        vmovups   XMMWORD PTR [-24+rsp], xmm10                  #124.12[spill]
        vaesenc   xmm14, xmm11, xmm10                           #128.12
        vmovdqu   xmm10, XMMWORD PTR .L_2il0floatpacket.4[rip]  #83.11
        vmovdqu   xmm11, XMMWORD PTR .L_2il0floatpacket.5[rip]  #84.11
        vpshufb   xmm4, xmm0, xmm10                             #83.11
        vpshufb   xmm2, xmm0, xmm11                             #84.11
        vpshufb   xmm0, xmm3, xmm6                              #85.11
        vpor      xmm4, xmm4, xmm0                              #87.9
        vaesenc   xmm1, xmm1, xmm4                              #110.12
        vmovdqu   xmm4, XMMWORD PTR .L_2il0floatpacket.7[rip]   #86.11
        vpshufb   xmm3, xmm3, xmm4                              #86.11
        vpshufb   xmm0, xmm5, xmm10                             #89.11
        vpor      xmm2, xmm2, xmm3                              #88.9
        vaesenc   xmm2, xmm1, xmm2                              #115.12
        vpshufb   xmm1, xmm5, xmm11                             #90.11
        vpshufb   xmm5, xmm7, xmm6                              #91.11
        vpshufb   xmm3, xmm8, xmm6                              #97.11
        vpshufb   xmm7, xmm7, xmm4                              #92.11
        vpshufb   xmm8, xmm8, xmm4                              #98.11
        vpshufb   xmm4, xmm12, xmm4                             #104.11
        vpor      xmm0, xmm0, xmm5                              #93.9
        vpor      xmm1, xmm1, xmm7                              #94.9
        vaesenc   xmm2, xmm2, xmm0                              #120.12
        vpshufb   xmm0, xmm9, xmm10                             #95.11
        vpshufb   xmm10, xmm13, xmm10                           #101.11
        vpor      xmm5, xmm0, xmm3                              #99.9
        vaesenc   xmm0, xmm15, xmm5                             #111.12
        vpshufb   xmm15, xmm9, xmm11                            #96.11
        vpshufb   xmm11, xmm13, xmm11                           #102.11
        vaesenc   xmm1, xmm2, xmm1                              #125.12
        vpor      xmm15, xmm15, xmm8                            #100.9
        vpshufb   xmm2, xmm12, xmm6                             #103.11
        vaesenc   xmm3, xmm0, xmm15                             #116.12
        vpor      xmm5, xmm10, xmm2                             #105.9
        vaesenc   xmm7, xmm3, xmm5                              #121.12
        vpor      xmm9, xmm11, xmm4                             #106.9
        vaesenc   xmm15, xmm7, xmm9                             #126.12
        vaesenc   xmm13, xmm14, xmm1                            #129.12
        vaesenc   xmm8, xmm13, xmm15                            #130.12
        vmovups   xmm4, XMMWORD PTR [-24+rsp]                   #74.9[spill]
        cmp       rax, -1                                       #74.9
        jne       ..B2.4        # Prob 82%                      #74.9
        vmovdqu   xmm2, XMMWORD PTR .L_2il0floatpacket.8[rip]   #
        vmovdqu   xmm0, XMMWORD PTR .L_2il0floatpacket.3[rip]   #
..B2.6:                         # Preds ..B2.5 ..B2.1
        cmp       rsi, 16                                       #136.15
        jb        ..B2.11       # Prob 50%                      #136.15
..B2.7:                         # Preds ..B2.2 ..B2.6
        mov       rax, rsi                                      #137.19
        shr       rax, 4                                        #137.19
        dec       rax                                           #138.9
        cmp       rax, -1                                       #138.9
        je        ..B2.11       # Prob 10%                      #138.9
..B2.9:                         # Preds ..B2.7 ..B2.9
        vmovdqu   xmm7, XMMWORD PTR [rdi]                       #139.35
        vpshufb   xmm5, xmm7, xmm0                              #140.21
        vpshufb   xmm3, xmm7, xmm2                              #141.21
        vpshufb   xmm6, xmm5, xmm2                              #142.26
        vaesenc   xmm4, xmm4, xmm5                              #144.12
        dec       rax                                           #138.9
        vaesenc   xmm8, xmm8, xmm7                              #143.12
        add       rdi, 16                                       #139.35
        vaesenc   xmm1, xmm1, xmm3                              #145.12
        add       rsi, -16                                      #150.22
        vaesenc   xmm9, xmm8, xmm4                              #147.12
        vaesenc   xmm15, xmm15, xmm6                            #146.12
        vaesenc   xmm10, xmm9, xmm1                             #148.12
        vaesenc   xmm8, xmm10, xmm15                            #149.12
        cmp       rax, -1                                       #138.9
        jne       ..B2.9        # Prob 82%                      #138.9
..B2.11:                        # Preds ..B2.9 ..B2.7 ..B2.6
        test      rsi, 15                                       #154.13
        je        ..B2.13       # Prob 50%                      #154.13
        shl       rsi, 4                                        #157.47
        vmovdqu   xmm3, XMMWORD PTR [rdi]                       #157.18
        mov       rax, QWORD PTR Mumbotron[rip]                 #157.18
        mov       rdx, QWORD PTR Jumbotron[rip]                 #159.15
        vpand     xmm5, xmm3, XMMWORD PTR [rsi+rax]             #157.18
        vpand     xmm6, xmm4, XMMWORD PTR [rsi+rdx]             #159.15
        vpor      xmm7, xmm5, xmm6                              #160.18
        vpshufb   xmm10, xmm7, xmm0                             #162.20
        vpshufb   xmm0, xmm7, xmm2                              #163.20
        vpshufb   xmm2, xmm10, xmm2                             #164.25
        vaesenc   xmm8, xmm8, xmm7                              #165.12
        vaesenc   xmm4, xmm4, xmm10                             #166.12
        vaesenc   xmm9, xmm8, xmm4                              #169.12
        vaesenc   xmm1, xmm1, xmm0                              #167.12
        vaesenc   xmm11, xmm9, xmm1                             #170.12
        vaesenc   xmm12, xmm15, xmm2                            #168.12
        vaesenc   xmm8, xmm11, xmm12                            #171.12
..B2.13:                        # Preds ..B2.12 ..B2.11
        vmovups   XMMWORD PTR DDAES[rip], xmm8                  #6.86
        ret                                                     #174.1
__sti__$E:
        mov       QWORD PTR Mumbotron[rip], offset flat: VectorsNeedNonVAriable1 #28.47
        mov       QWORD PTR Jumbotron[rip], offset flat: VectorsNeedNonVAriable2 #49.47
        ret                                                     #28.47
Mumbotron:
Jumbotron:
DDAES:
VectorsNeedNonVAriable1:
...
VectorsNeedNonVAriable2:
...
.L_2il0floatpacket.0:
        .long   0x6295c58d,0x62b82175,0x07bb0142,0x6c62272e
.L_2il0floatpacket.1:
        .long   0xc4e576cc,0x2d98c384,0xaac55036,0xdd268dbc
.L_2il0floatpacket.2:
        .long   0xcaee0535,0x1023b4c8,0x47b6bbb3,0xc8b15368
.L_2il0floatpacket.3:
        .long   0x0c0d0e0f,0x08090a0b,0x04050607,0x00010203
.L_2il0floatpacket.4:
        .long   0x80018000,0x80038002,0x80058004,0x80078006
.L_2il0floatpacket.5:
        .long   0x80098008,0x800b800a,0x800d800c,0x800f800e
.L_2il0floatpacket.6:
        .long   0x01800080,0x03800280,0x05800480,0x07800680
.L_2il0floatpacket.7:
        .long   0x09800880,0x0b800a80,0x0d800c80,0x0f800e80
.L_2il0floatpacket.8:
        .long   0x09010800,0x0b030a02,0x0d050c04,0x0f070e06

Hope, someone will improve on it and share.

Cyan4973 / xxHash

Feedback: Speed/Collision Benchmark - Gumbotron vs XXH3 #568