Cyan4973 / xxHash

Extremely fast non-cryptographic hash algorithm
http://www.xxhash.com/
Other
9.04k stars 773 forks source link

Optimize scalar secret access #870

Open easyaspi314 opened 1 year ago

easyaspi314 commented 1 year ago

So notably XXH3 assumes you got unaligned access, as it has become kinda standard on modern processors. However, it doesn't have to be as bad as it is for strict alignment targets.

With RISC-V gaining traction and NOT having unaligned access, we have a modern 64-bit target that will be running scalar with strict alignment.

One thing that could be clearly optimized is the secret access.

When we remove SIMD and look at XXH3 on an 8 byte granularity, we can see that we have a FIFO queue. (Half scale)

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
[   stripe 1    ]
    [    stripe 2   ]
        [    stripe 3   ]
            [    stripe 4   ]
                [    stripe 5   ]

We could create a ring buffer of ready-to-use values:

XXH3_accumulate(..., xsecret)
{
    u64 secret_buffer[8];
    int secret_buffer_pos = 0;
    for (i = 0; i < 7; i++) { // one less
        secret_buffer[i] = XXH_readLE64(xsecret + 8 * i);
     }
     for each stripe {
         secret_buffer[(secret_buffer_pos + 7) % 8] = XXH_readLE64(xsecret + ...);
         XXH3_accumulate_512(...);
         secret_buffer_pos = (secret_buffer_pos + 1) % 8;
     }
}

This could be the secret (heh) to a possible massive performance increase on strict alignment scalar without an O(n) buffer, as with 8 byte granularity, the overlapping reads are entirely avoidable, and we have 1/8 the unaligned accesses at the cost of a ring buffer which could possibly be done in registers as well (but that would be detrimental to dumber compilers).

This would only make sense on strict alignment/big endian with slow swaps.

sh1boot commented 1 year ago

I believe RISC-V implementations are allowed unaligned access, so be careful how such code is enabled, so as to not slow down the cores that support unaligned.