Optimize scalar secret access

So notably XXH3 assumes you got unaligned access, as it has become kinda standard on modern processors. However, it doesn't have to be as bad as it is for strict alignment targets.

With RISC-V gaining traction and NOT having unaligned access, we have a modern 64-bit target that will be running scalar with strict alignment.

One thing that could be clearly optimized is the secret access.

When we remove SIMD and look at XXH3 on an 8 byte granularity, we can see that we have a FIFO queue. (Half scale)

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
[   stripe 1    ]
    [    stripe 2   ]
        [    stripe 3   ]
            [    stripe 4   ]
                [    stripe 5   ]

We could create a ring buffer of ready-to-use values:

XXH3_accumulate(..., xsecret)
{
    u64 secret_buffer[8];
    int secret_buffer_pos = 0;
    for (i = 0; i < 7; i++) { // one less
        secret_buffer[i] = XXH_readLE64(xsecret + 8 * i);
     }
     for each stripe {
         secret_buffer[(secret_buffer_pos + 7) % 8] = XXH_readLE64(xsecret + ...);
         XXH3_accumulate_512(...);
         secret_buffer_pos = (secret_buffer_pos + 1) % 8;
     }
}

This could be the secret (heh) to a possible massive performance increase on strict alignment scalar without an O(n) buffer, as with 8 byte granularity, the overlapping reads are entirely avoidable, and we have 1/8 the unaligned accesses at the cost of a ring buffer which could possibly be done in registers as well (but that would be detrimental to dumber compilers).

This would only make sense on strict alignment/big endian with slow swaps.

Cyan4973 / xxHash

Optimize scalar secret access #870