Closed easyaspi314 closed 3 years ago
Translating my AVX2 optimizations to SSE2 got it nice and performant for me. Unfortunately scalar continues to kick my ass. Think I'll call it good enough for now & wait for the updated algorithm to be a bit more finalized before evaluating & optimizing short keys.
Method | ByteLength | Mean | Throughput |
---|---|---|---|
xxHash64_Scalar | 102400 | 8.390 us | 11,639.8 MB/s |
xxHash3_AVX2 | 102400 | 2.444 us | 39,959.6 MB/s |
xxHash3_SSE2 | 102400 | 4.917 us | 19,859.1 MB/s |
xxHash3_Scalar | 102400 | 23.433 us | 4,167.5 MB/s |
( @easyaspi314 - it's effectively a completely new set of SSE2 code, so there's a fair chance whatever caused your hang has resolved itself )
Huh. Apparently, System.Runtime.Intrinsics.X86.Bmi2.X64.MultiplyNoFlags
will cause a hang on the first preview. I had to find that out the hard way because apparently the debugger hates me and won't let me pause.
I updated to preview3, and it worked fine. I honestly didn't even realize I was on the wrong preview.
// * Summary *
BenchmarkDotNet=v0.11.4, OS=macOS Mojave 10.14.2 (18C54) [Darwin 18.2.0]
Intel Core i7-2635QM CPU 2.00GHz (Sandy Bridge), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-preview3-010431
[Host] : .NET Core 3.0.0-preview3-27503-5 (CoreCLR 4.6.27422.72, CoreFX 4.7.19.12807), 64bit RyuJIT
Job-RNHJPJ : .NET Core 3.0.0-preview3-27503-5 (CoreCLR 4.6.27422.72, CoreFX 4.7.19.12807), 64bit RyuJIT
MaxRelativeError=0.05 EnvironmentVariables=COMPlus_TieredCompilation=0
Method | ByteLength | Mean | Error | StdDev | Throughput |
---|---|---|---|---|---|
xxHash64_Scalar | 102400 | 16.766 us | 0.1894 us | 0.1772 us | 5,824.8 MB/s |
xxHash3_AVX2 | 102400 | 6.826 us | 0.0369 us | 0.0345 us | 14,306.4 MB/s |
xxHash3_SSE2 | 102400 | 6.831 us | 0.0400 us | 0.0355 us | 14,296.1 MB/s |
xxHash3_Scalar | 102400 | 34.039 us | 0.3778 us | 0.3534 us | 2,868.9 MB/s |
For the reference, I made this change, which I highly suggest you do:
#if NETCOREAPP3_0
if (System.Runtime.Intrinsics.X86.Avx2.IsSupported && UseAvx2)
{
LongSequenceHash_AVX2(ref acc2, data);
}
// fall back to SSE2 if we request AVX2 on an unsupported machine.
else if (System.Runtime.Intrinsics.X86.Sse2.IsSupported && (UseSse2 || UseAvx2))
{
LongSequenceHash_SSE2(ref acc2, data);
}
else
#endif
{
LongSequenceHash_Scalar(ref acc2, data);
}
Unrolling XXH64 8 times seems to help a little bit, although it could just be fluctuations.
Here are the results compared with native.
Method | ByteLength | Mean | Error | StdDev | Throughput |
---|---|---|---|---|---|
xxHash64_Scalar | 102400 | 15.441 us | 0.1430 us | 0.1268 us | 6,324.4 MB/s |
xxHash3_SSE2 | 102400 | 6.912 us | 0.0885 us | 0.0785 us | 14,129.3 MB/s |
xxHash3_Scalar | 102400 | 33.865 us | 0.1445 us | 0.1207 us | 2,883.7 MB/s |
XXH64_Native (clang -march=sandybridge , inline asm hack) |
102400 | 12.470 us | 0.0742 us | 0.0694 us | 7,831.0 MB/s |
XXH3_Native (clang -march=sandybridge ) |
102400 | 6.185 us | 0.0356 us | 0.0315 us | 15,789.8 MB/s |
XXH64_Native (clang -march=x86-64 , no inline asm) |
102400 | 14.538 us | 0.1167 us | 0.1092 us | 6,717.1 MB/s |
XXH3_Native (clang -march=x86-64 ) |
102400 | 6.323 us | 0.0391 us | 0.0347 us | 15,444.9 MB/s |
Note: The inline assembly hack was this:
static U64 XXH64_round(U64 acc, U64 input)
{
__asm__(
"imulq %[PRIME64_2], %[input]\n"
"addq %[input], %[acc]\n"
"shldq $31, %[acc], %[acc]\n"
"imulq %[PRIME64_1], %[acc]"
: [acc] "+r" (acc), [input] "+r" (input)
: [PRIME64_1] "r" (PRIME64_1), [PRIME64_2] "r" (PRIME64_2)
);
return acc;
}
This is because while Clang will properly tune to use shld
on Sandy Bridge (it has better throughput than rol
for some reason), it improperly generates 4 excess mov
instructions at the bottom of the loop for no reason.
Either way, that is very nice performance, good job!
Edit: How I interfaced with native:
[DllImport("/Users/user/xxh3-fix/libxxhash.dylib", CallingConvention = CallingConvention.Cdecl)]
internal extern static ulong XXH64(byte[] data, UIntPtr length, ulong seed);
[DllImport("/Users/user/xxh3-fix/libxxhash.dylib", CallingConvention = CallingConvention.Cdecl)]
internal extern static ulong XXH3_64bits(byte[] data, UIntPtr length);
[Benchmark]
public unsafe ulong XXH64_Native()
{
return XXH64(_bytes, (UIntPtr)_bytes.Length, 0);
}
[Benchmark]
public unsafe ulong XXH3_Native()
{
return XXH3_64bits(_bytes, (UIntPtr)_bytes.Length);
}
internal extern
:thinking:
Scalar isn't that fast, tbh.
The reason XXH3 is so fast is because it takes advantage of SSE2 and NEON which are guaranteed to be supported in all modern desktop and mobile CPUs.
The scalar version was made to be decent, and it is pretty decent in most scenarios.
But, yeah, for perspective, this is what XXH3 looks like on the native one with no SSE2:
clang -march=x86-64 -O3 -DXXH_VECTORIZE=0 -mno-sse # had to remove for xxhsum
./xxhsum 0.7.0 (64-bits x86_64 + SSE2 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH32 : 102400 -> 42755 it/s ( 4175.3 MB/s)
XXH32 unaligned : 102400 -> 42980 it/s ( 4197.3 MB/s)
XXH64 : 102400 -> 69066 it/s ( 6744.8 MB/s)
XXH64 unaligned : 102400 -> 67172 it/s ( 6559.7 MB/s)
XXH3_64bits : 102400 -> 47720 it/s ( 4660.1 MB/s)
XXH3_64b unaligned : 102400 -> 45991 it/s ( 4491.3 MB/s)
clang -m32 -O3 -DXXH_VECTORIZE=0 -mno-sse
./xxhsum 0.7.0 (32-bits i386 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH32 : 102400 -> 35037 it/s ( 3421.6 MB/s)
XXH32 unaligned : 102400 -> 34626 it/s ( 3381.4 MB/s)
XXH64 : 102400 -> 13667 it/s ( 1334.6 MB/s)
XXH64 unaligned : 102400 -> 13618 it/s ( 1329.9 MB/s)
XXH3_64bits : 102400 -> 24640 it/s ( 2406.2 MB/s)
XXH3_64b unaligned : 102400 -> 24091 it/s ( 2352.7 MB/s)
Considering that it is C#, I wouldn't be too worried about it, because most people are going to be running on x86_64.
Thanks, that is a helpful point of comparison; glad to see the gap between my version and the native version is a fair bit smaller than I had thought. But unfortunately since the SSE2 & AVX2 code requires .NET Core 3.0, C# executed by Mono or the plain old .NET Framework (i.e. the vast majority it today) will be using the scalar version, so I am still going to concern myself with optimizing it as much as possible.
@Cyan4973 @42Bastian @Zhentar @sergeevabc @ifduyue
Wanna try this snippet and make sure it detects features properly on x86/x86_64? Try as many compilers as possible.
#include <stdio.h>
#include <string.h>
#if defined(_MSC_VER) && (defined(_M_IX86) || defined(_M_AMD64) || defined(_M_X64))
# include <intrin.h>
# define CPUIDEX __cpuidex
# define CPUID __cpuid
#elif (defined(__GNUC__) || defined(__TINYC__)) && (defined(__i386__) || defined(__x86_64__))
static void CPUIDEX(int *cpuInfo, int function_id, int subfunction_id)
{
int eax, ecx;
eax = function_id;
ecx = subfunction_id;
__asm__ __volatile__("cpuid"
: "+a" (eax), "=b" (cpuInfo[1]), "=c" (ecx), "=d" (cpuInfo[3]));
cpuInfo[0] = eax;
cpuInfo[2] = ecx;
}
static void CPUID(int *cpuInfo, int function_id)
{
CPUIDEX(cpuInfo, function_id, 0);
}
#else
# warning "x86 or x86_64 please, add your compiler to the macros if needed and tell me"
static void CPUIDEX(int *cpuInfo, int function_id, int subfunction_id)
{
memset(cpuInfo, 0, 4 * sizeof(int));
}
static void CPUID(int *cpuInfo, int function_id)
{
memset(cpuInfo, 0, 4 * sizeof(int));
}
#endif
int main() {
int max, data[4];
CPUID(data, 0);
max = data[0];
if (max >= 1) {
CPUID(data, 1);
printf("sse2: %d\n", !!(data[3] & (1 << 26)));
printf("avx: %d\n", !!(data[2] & (1 << 28)));
}
if (max >= 7) {
CPUID(data, 7);
printf("avx2: %d\n", !!(data[1] & (1 << 5)));
}
return 0;
}
The collision analyzer has been put to good use during the week end. It's a very precise tool, which makes it possible to observe subtle difference which are otherwise impossible to notice with "standard" test tools (aka smhasher and al.).
This leads to several small but key modifications in the algorithm, bringing it in line with maximal theoretical collision rate.
As part of the changes, the function mul128_fold64
has been slightly modified, to use ^
for folding instead of +
. I could update most variants except the aarch64
one :
https://github.com/Cyan4973/xxHash/blob/xxh3/xxh3.h#L163
@easyaspi314 , would you mind having a look at it ? Thanks !
I have inserted a WIP CPU dispatcher into xxhsum. I have binaries for macOS and Linux here for both 32-bit and 64-bit. I am too lazy to start up my Windows PC, so lmk if you actually want a build.
The Linux binaries are static, and they should work in WSL.
xxhsum-binaries-linux-darwin.zip
If you run the benchmark, it should tell you what version of xxh3 it runs:
./xxhsum-darwin-i686 0.7.0 (32-bits i386 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH32 : 102400 -> 34108 it/s ( 3330.8 MB/s)
XXH32 unaligned : 102400 -> 33667 it/s ( 3287.8 MB/s)
XXH64 : 102400 -> 13455 it/s ( 1314.0 MB/s)
XXH64 unaligned : 102400 -> 13527 it/s ( 1321.0 MB/s)
1-XXH3_64bits : 102400 ->
Using HashLong version __XXH3_HASH_LONG_SSE2
XXH3_64bits : 102400 -> 158898 it/s (15517.4 MB/s)
XXH3_64b unaligned : 102400 -> 158204 it/s (15449.6 MB/s)
If you have a CPU with AVX2, it should say __XXH3_HASH_LONG_AVX2.
Edit: It is to be noted that these were compiled for generic i686 and generic x86-64. These all work on my system.
Also, sure I'll take a look at the multiply code.
I don't know if it is worth it to do the inline assembly for that. Pretty much the only aarch64 compilers are GCC or LLVM-based, and therefore it should have the __uint128_t
.
MSVC for ARM/aarch64 is pretty much dead (was it ever alive 😛?), and besides, msvc uses a different syntax.
Tested Windows binaries I cross-compiled with mingw. (Cross compiling for Windows? 🤯)
xxhsum-windows-binaries.zip reuploaded
Please test so I know that my dispatcher is working correctly; I currently have no idea if it is because I only have Sandy Bridge.
I'll test when I get home if my sister will ever give up her laptop. :joy:
Actually, works fine in wine.
It's picking SSE2 rather than AVX2 for me
Oh, I think I know the problem, it isn't clearing ecx
. Here's a new one with Windows, Mac, and Linux binaries.
When I get home, I might see if I can find and boot my dinosaur Pentium III laptop with Windows 2000 to see if it chooses the scalar version just because.
Heh, the virtual desktop at my college has AVX2.
I can run a Windows 10 VM from a Mac laptop at work. Here are some results :
.\xxhsum-windows-i686.exe -b5
xxhsum-windows-i686.exe 0.7.0 (32-bits i386 little endian), GCC 8.3.0, by Yann Collet
Sample of 100 KB...
1-XXH3_64bits : 102400 ->
Using HashLong version __XXH3_HASH_LONG_AVX2
XXH3_64bits : 102400 -> 512000 it/s (50000.0 MB/s)
.\xxhsum-windows-x86_64.exe -b5
xxhsum-windows-x86_64.exe 0.7.0 (64-bits x86_64 + SSE2 little endian), GCC 8.3.0, by Yann Collet
Sample of 100 KB...
1-XXH3_64bits : 102400 ->
Using HashLong version __XXH3_HASH_LONG_AVX2
XXH3_64bits : 102400 -> 513542 it/s (50150.5 MB/s)
So both versions correctly detect and use the AVX2 code path.
Cool. I'll make a branch.
By the way, I also made it so we only have to write the x86 code once with the power of…
( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( (M) (A) (C) (R) (O) (S) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )
#define XXH_CONCAT_2(x, y) x##y
#define XXH_CONCAT(x, y) XXH_CONCAT_2(x, y)
/* These macros and typedefs make it so we only have to write the x86 code once.
* SSE2 AVX2
* XXH_vec = __m128i __m256i
* XXH_MM(add_epi32) = _mm_add_epi32 _mm256_add_epi32
* XXH_MM_SI(loadu) = _mm_loadu_si128 _mm256_loadu_si256 */
#if XXH_VECTOR == XXH_AVX2
typedef __m256i XXH_vec;
# define XXH_MM(x) XXH_CONCAT(_mm256_, x)
# define XXH_MM_SI(x) XXH_MM(XXH_CONCAT(x, _si256))
#elif XXH_VECTOR == XXH_SSE2
typedef __m128i XXH_vec;
# define XXH_MM(x) XXH_CONCAT(_mm_, x)
# define XXH_MM_SI(x) XXH_MM(XXH_CONCAT(x, _si128))
#endif
/* lame, I know */
#define VEC_SIZE sizeof(XXH_vec)
XXH_FORCE_INLINE void
XXH3_accumulate_512(void *restrict acc, const void *restrict data, const void *restrict key)
{
#if (XXH_VECTOR == XXH_AVX2) || (XXH_VECTOR == XXH_SSE2)
assert(((size_t)acc) & (VEC_SIZE - 1) == 0);
{
ALIGN(VEC_SIZE) XXH_vec* const xacc = (XXH_vec *) acc;
const XXH_vec* const xdata = (const XXH_vec *) data;
const XXH_vec* const xkey = (const XXH_vec *) key;
size_t i;
for (i=0; i < STRIPE_LEN / VEC_SIZE; i++) {
XXH_vec const d = XXH_MM_SI(loadu) (xdata+i);
XXH_vec const k = XXH_MM_SI(loadu) (xkey+i);
XXH_vec const dk = XXH_MM_SI(xor) (d,k); /* uint32 dk[8] = {d0+k0, d1+k1, d2+k2, d3+k3, ...} */
XXH_vec const res = XXH_MM(mul_epu32) (dk, XXH_MM(shuffle_epi32) (dk, 0x31)); /* uint64 res[4] = {dk0*dk1, dk2*dk3, ...} */
XXH_vec const add = XXH_MM(add_epi64) (d, xacc[i]);
xacc[i] = XXH_MM(add_epi64) (res, add);
}
}
#else ...
}
XXH_FORCE_INLINE void
XXH3_scrambleAcc(void* restrict acc, const void* restrict key)
{
#if (XXH_VECTOR == XXH_AVX2) || (XXH_VECTOR == XXH_SSE2)
assert(((size_t)acc) & (VEC_SIZE - 1) == 0);
{
ALIGN(VEC_SIZE) XXH_vec* const xacc = (XXH_vec*) acc;
const XXH_vec* const xkey = (const XXH_vec*) key;
const XXH_vec k1 = XXH_MM(set1_epi32) ((int)PRIME32_1);
const XXH_vec k2 = XXH_MM(set1_epi32) ((int)PRIME32_2);
size_t i;
for (i=0; i < STRIPE_LEN / VEC_SIZE; i++) {
XXH_vec data = xacc[i];
XXH_vec const shifted = XXH_MM(srli_epi64) (data, 47);
data = XXH_MM_SI(xor) (data, shifted);
{
XXH_vec const k = XXH_MM_SI(loadu) (xkey+i);
XXH_vec const dk = XXH_MM_SI(xor) (data,k); /* U32 dk[4] = {d0+k0, d1+k1, d2+k2, d3+k3} */
XXH_vec const dk1 = XXH_MM(mul_epu32) (dk,k1);
XXH_vec const d2 = XXH_MM(shuffle_epi32) (dk, 0x31);
XXH_vec const dk2 = XXH_MM(mul_epu32) (d2,k2);
xacc[i] = XXH_MM_SI(xor) (dk1, dk2);
}
}
}
#else // ...
}
(I know I need to realign)
As for my Makefile logic, I have it so it will compile the multiple code paths if you add MULTI_TARGET=1
to make
. IDK if we should do it by default or not, so for now it is optional.
Basically, the hashLong code is now in xxh3-target.c
. If XXH_MULTI_TARGET
is not defined, it will just be #included
in xxh3.h, otherwise, it will be compiled 3 times to make xxh3-scalar.o, xxh3-avx,o, and xxh3-sse2.o.
# XXX: enable by default?
ifndef MULTI_TARGET
TARGET_OBJS :=
else
# Multi targeting only works for x86 and x86_64 right now.
ifneq (,$(filter __i386__ __x86_64__ _M_IX86 _M_X64 _M_AMD64,$(shell $(CC) -E -dM -xc /dev/null)))
TARGET_OBJS := xxh3-avx2.o xxh3-sse2.o xxh3-scalar.o
CFLAGS += -DXXH_MULTI_TARGET
else
TARGET_OBJS :=
endif
endif
xxhsum: xxhash.o xxhsum.o $(TARGET_OBJS)
xxh3-avx2.o: xxh3-target.c xxhash.h
$(CC) -c $(FLAGS) $< -mavx2 -o $@
xxh3-sse2.o: xxh3-target.c xxhash.h
$(CC) -c $(FLAGS) $< -msse2 -mno-sse3 -o $@
xxh3-scalar.o: xxh3-target.c xxhash.h
$(CC) -c $(FLAGS) $< -mno-sse2 -o $@
I was checking Agner Fog's timing tables and I think we should change pshufd
to psrlq
.
There is little to no difference on modern processors, but pshufd
tends to be slower than psrlq
on older chips:
CPU | pshufd cycles |
pshufd latency |
psrlq cycles |
psrlq latency |
---|---|---|---|---|
AMD K8 | 3 | 3 | 2 | 2 |
AMD K10 | 1 | 3 | 1 | 3 |
AMD Big Rigs | 1 | 2 | 1 | 2 |
AMD Ryzen+ | 2 | 1 | 2 | 1 |
AMD Bobcat | 3 | 2 | 2 | 1 |
AMD Jaguar | 1 | 2 | 1 | 1 |
Intel Pentium 4 | 1 | 5 | 1 | 5 |
Intel Pentium M | 3 | 2 | 2 | 2 |
Intel Merom | 3 | 1 | 1 | 1 |
Intel Penryn+ | 1 | 1 | 1 | 1 |
Wait, scratch that. Duh. pshufd
is gonna be faster because it doesn't overwrite the source operand. SSE2 would need an extra movdqa
with psrlq
.
@Cyan4973 would this be a correct implementation of the (updated) scalar loops? Because even when compiled without SSE2 (I checked the assembly, all scalar instructions), I get about 6.6-7.0 GB/s this way on 64-bit compared to 5.3 with the previous method. The reason for this is this makes clang use 64-bit xor
instructions instead of pairs of 32-bit xor
s.
Since this has nothing to bother 32-bit with (it will be doing everything with two registers anyways), it doesn't slow it down.
It only works with the XXH_readLE64
though, two XXH_readLE32
s and an shift+or don't work.
XXH_FORCE_INLINE void
XXH3_accumulate_512(void *restrict acc, const void *restrict data, const void *restrict key)
{
U64* const xacc = (U64*) acc; /* presumed aligned */
const U32* const xdata = (const U32*) data;
const U32* const xkey = (const U32*) key;
size_t i;
for (i=0; i < ACC_NB; i++) {
U64 const data_val = XXH_readLE64(xdata + 2 * i);
U64 const key_val = XXH3_readKey64(xkey + 2 * i);
U64 const data_key = key_val ^ data_val;
xacc[i] += XXH_mult32to64(data_key & 0xFFFFFFFF, data_key >> 32);
xacc[i] += data_val;
}
}
XXH_FORCE_INLINE void
XXH3_scrambleAcc(void* restrict acc, const void* restrict key)
{
U64* const xacc = (U64*) acc;
const U32* const xkey = (const U32*) key;
size_t i;
for (i = 0; i < ACC_NB; i++) {
U64 const acc_val = xacc[i];
U64 const shifted = acc_val >> 47;
U64 const data = acc_val ^ shifted;
U64 const key_val = XXH3_readKey64(xkey + 2 * i);
U64 const data_key = key_val ^ data;
U64 const product1 = XXH_mult32to64(PRIME32_1, (data_key & 0xFFFFFFFF));
U64 const product2 = XXH_mult32to64(PRIME32_2, (data_key >> 32));
xacc[i] = product1 ^ product2;
}
}
@Zhentar this might be useful in your quest to optimize your C# implementation.
Looks great @easyaspi314 ,
the accumulate_512
function looks very good to me.
Sidenote : I'm modifying (again!) the scramble function, in a way which should be more friendly to scalar (though it's a side effect, the main objective is to nullify space reduction, which the new version seems to tackle completely). I'll update xxh3
branch later today.
note : OK, I've got 2 possibilities for the scrambler.
I selected one which is slightly more friendly to the scalar version. It adds one operation in the vectorial version, but also saves one const
register.
I've uploaded the new version, even though it's still undergoing quality testing. Collision rate is perfect, but I'm not yet sure of dispersion / bias.
note 2 : I tested your scalar variant of accumulate_512
, and it's way better. 50% faster for gcc
, and 2x faster for clang
, which was already much faster than gcc
to begin with. I suspect clang
might autovectorize something in the new scalar path, because its resulting level of performance is better XXH64
(though not at the level of manual SSE2
variant).
You're killin' me, Smalls! :joy:
As for scalar, . And like I said, with the fixed version, it gets pretty nice performance; it is actually faster than XXH64 (although it is just a ruse, Clang generates 4 erroneous mov
s in the XXH64 loop bringing it down from 10.1 GB/s).
From what I saw, that full 64-bit multiply would be terrible for 32-bit. Full 64-bit multiplies were what slowed down XXH64.
U64 mult64(U64 a, U64 b)
{
return a * b;
}
mult64: # @mult64
push esi
mov ecx, dword ptr [esp + 16]
mov esi, dword ptr [esp + 8]
mov eax, ecx
imul ecx, dword ptr [esp + 12]
mul esi
imul esi, dword ptr [esp + 20]
add edx, ecx
add edx, esi
pop esi
ret
mult64:
push {r11, lr}
umull r12, lr, r2, r0
mla r1, r2, r1, lr
mla r1, r3, r0, r1
mov r0, r12
pop {r11, pc}
(That is two 32-bit multiplies, one 32->64-bit multiply, and two 32-bit adds).
With my version, code gen doesn't seem to be that bad on all platforms.
All x86_64 chips have SSE2 and all aarch64 chips have NEON, and with my fixes, XXH3 is very fast even without SSE2.
By the way, what do you think about my current version? I made a few (mostly aesthetic) changes:
static U64 XXH3_mul128_fold64(U64 ll1, U64 ll2)
{
__uint128_t lll = (__uint128_t)ll1 * ll2;
return (U64)lll ^ (U64)(lll >> 64);
}
Yes, I like your changes @easyaspi314, they help readability, it's a great positive.
Oh wait, I saw, you did a 64 by 32 multiply.
As for your code, x86_64 and aarch64 are much smaller, but 32-bit doesn't see as much of a difference.
However, the one I was using was faster on both i386 and x86_64-scalar:
32-bit:
My version: (clang -m32 -O3 -mno-sse2
)
./xxhsum 0.7.0 (32-bits i386 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH32 : 102400 -> 34976 it/s ( 3415.6 MB/s)
XXH32 unaligned : 102400 -> 34427 it/s ( 3362.0 MB/s)
XXH64 : 102400 -> 13591 it/s ( 1327.2 MB/s)
XXH64 unaligned : 102400 -> 13418 it/s ( 1310.3 MB/s)
XXH3_64bits : 102400 -> 29382 it/s ( 2869.4 MB/s)
XXH3_64b unaligned : 102400 -> 29458 it/s ( 2876.7 MB/s)
Your new version:
./xxhsum 0.7.0 (32-bits i386 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH32 : 102400 -> 34746 it/s ( 3393.2 MB/s)
XXH32 unaligned : 102400 -> 34149 it/s ( 3334.9 MB/s)
XXH64 : 102400 -> 13632 it/s ( 1331.3 MB/s)
XXH64 unaligned : 102400 -> 13470 it/s ( 1315.4 MB/s)
XXH3_64bits : 102400 -> 25121 it/s ( 2453.2 MB/s)
XXH3_64b unaligned : 102400 -> 24559 it/s ( 2398.4 MB/s)
64-bit
My version: (clang -mno-sse2
, it says SSE2 because clang messes up the benchmark without it)
./xxhsum 0.7.0 (64-bits x86_64 + SSE2 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH32 : 102400 -> 42382 it/s ( 4138.9 MB/s)
XXH32 unaligned : 102400 -> 42508 it/s ( 4151.1 MB/s)
XXH64 : 102400 -> 68385 it/s ( 6678.3 MB/s)
XXH64 unaligned : 102400 -> 67336 it/s ( 6575.8 MB/s)
XXH3_64bits : 102400 -> 60443 it/s ( 5902.6 MB/s)
XXH3_64b unaligned : 102400 -> 59623 it/s ( 5822.5 MB/s)
Your version:
./xxhsum 0.7.0 (64-bits x86_64 + SSE2 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH32 : 102400 -> 41963 it/s ( 4098.0 MB/s)
XXH32 unaligned : 102400 -> 42648 it/s ( 4164.9 MB/s)
XXH64 : 102400 -> 68673 it/s ( 6706.3 MB/s)
XXH64 unaligned : 102400 -> 67688 it/s ( 6610.1 MB/s)
XXH3_64bits : 102400 -> 48290 it/s ( 4715.8 MB/s)
XXH3_64b unaligned : 102400 -> 46441 it/s ( 4535.3 MB/s)
((My laptop was a little warm so it wasn't going into Turbo Boost)
Wait, clang was just not unrolling the loop. The result is a tiny bit faster.
32-bit:
XXH3_64bits : 102400 -> 30095 it/s ( 2939.0 MB/s)
XXH3_64b unaligned : 102400 -> 30123 it/s ( 2941.7 MB/s)
64-bit
XXH3_64bits : 102400 -> 63576 it/s ( 6208.6 MB/s)
XXH3_64b unaligned : 102400 -> 62257 it/s ( 6079.7 MB/s)
What are the differences between the versions tested ? Is it limited to the scrambler ? I'm not completely sure what is being measured.
I would expect most of the speed difference to come from the new accumulator, which I already acknowledged is a much faster variant. But that doesn't tell us much about the scrambler. I would expect the speed difference attributed to the scrambler to be pretty small, if measurable at all.
P.S : I made some tests by integrating your new accumulator, comparing the new and the old scrambler. Differences are small but measurable
variant | old | new | diff |
---|---|---|---|
clang x64 | 17100 MB/s | 17700 MB/s | + |
clang -m32 | 17100 MB/s | 17000 MB/s | - |
clang -m32 -no-sse2 | 4900 MB/s | 5050 MB/s | + |
gcc x64 | 9550 MB/s | 9650 MB/s | + |
gcc -m32 | 3650 MB/s | 3600 MB/s | - |
gcc -m32 -no-sse2 | 3600 MB/s | 3600 MB/s | = |
Yeah, the first version was my working tree vs your tree, but that last one was on my tree, differing only in the scrambler.
Clang was notably bitching about not being able to unroll the loop… Yes, the same Clang that does this (it will unroll thousands of times, it has no limit)
Also, for some reason, my laptop was warm because there were rogue dotnet benchmarks running in the background and the processes were locking my CPU into Turbo Boost.
why
Also, does GCC have SSE2 enabled? Try -msse2. I knew GCC was bad, but I didn't know it was that bad.
Well, turning on -msse2
with -m32
on gcc
is actually even worse :
speed goes down to 1100 MB/s
Whaaaaat?
~/xxh3 $ gcc-8 -m32 -mno-sse3 -msse2 -DXXH_VECTOR=0 -c -O3 xxhash.c
~/xxh3 $ gcc-8 -m32 -mno-sse3 -msse2 -DXXH_VECTOR=0 -c -O3 xxhsum.c
~/xxh3 $ clang *.o -m32 -o xxhsum
ld: warning: The i386 architecture is deprecated for macOS (remove from the Xcode build setting: ARCHS)
ld: warning: ignoring file /usr/local/Cellar/llvm/8.0.0/lib/clang/8.0.0/lib/darwin/libclang_rt.osx.a, missing required architecture i386 in file /usr/local/Cellar/llvm/8.0.0/lib/clang/8.0.0/lib/darwin/libclang_rt.osx.a (2 slices)
~/xxh3 $ ./xxhsum -b
./xxhsum 0.7.0 (32-bits i386 + SSE2 little endian), GCC 8.3.0, by Yann Collet
Sample of 100 KB...
XXH32 : 102400 -> 53042 it/s ( 5179.9 MB/s)
XXH32 unaligned : 102400 -> 52267 it/s ( 5104.2 MB/s)
XXH64 : 102400 -> 16333 it/s ( 1595.0 MB/s)
XXH64 unaligned : 102400 -> 16407 it/s ( 1602.2 MB/s)
XXH3_64bits : 102400 -> 8526 it/s ( 832.6 MB/s)
XXH3_64b unaligned : 102400 -> 8584 it/s ( 838.3 MB/s)
(sorry about the AT&T syntax, GCC for macOS doesn't support Intel syntax)
Not a huge deal because GCC will be using the SSE2 path in that case, so this isn't too much to worry about.
Wait, I figured out the difference, I passed kKey as a pointer to XXH3_hashLong (so I didn't have duped keys). Passing kKey as a pointer to XXH3_hashLong gets 3.2 GB/s.
Also, I found my ancient P3 laptop, but I can't seem to find the power cord. It's one of those things where you find it when you are looking for something else but not when you are looking for it. 😒
I guess this is the case I had before with GCC for Thumb. GCC figures out it can optimize a constant and does it in the absolute worst way possible.
I think I'm gonna call it "constant propagurgitation".
https://github.com/easyaspi314/xxhash/tree/multitarget
I uploaded my changes. I have some of the updated algorithm, the dispatcher, updated scalar code, and a few other goodies. I'm not opening a PR until you tell me how that new code works out.
Also, I need to fix these god damn ugly merge conflicts. 😒
#if !defined(XXH3_TARGET_C) && defined(__GNUC__)
__attribute__((__constructor__))
#endif
static void
XXH3_featureTest(void)
{
int max, data[4];
/* First, get how many CPUID function parameters there are by calling CPUID with eax = 0. */
XXH_CPUID(data, /* eax */ 0);
max = data[0];
/* AVX2 is on the Extended Features page (eax = 7, ecx = 0), on bit 5 of ebx. */
if (max >= 7) {
XXH_CPUIDEX(data, /* eax */ 7, /* ecx */ 0);
if (data[1] & (1 << 5)) {
cpu_mode = XXH_CPU_MODE_AVX2;
return;
}
}
<<<<<<< HEAD
/* SSE2 is on the Processor Info and Feature Bits page (eax = 1), on bit 26 of edx. */
if (max >= 1) {
XXH_CPUID(data, /* eax */ 1);
if (data[3] & (1 << 26)) {
cpu_mode = XXH_CPU_MODE_SSE2;
return;
}
=======
#else /* scalar variant of Accumulator - universal */
U64* const xacc = (U64*) acc; /* presumed aligned */
const U32* const xdata = (const U32*) data;
const U32* const xkey = (const U32*) key;
int i;
for (i=0; i < (int)ACC_NB; i++) {
int const left = 2*i;
int const right= 2*i + 1;
U32 const dataLeft = XXH_readLE32(xdata + left);
U32 const dataRight = XXH_readLE32(xdata + right);
xacc[i] += XXH_mult32to64(dataLeft ^ xkey[left], dataRight ^ xkey[right]);
xacc[i] += dataLeft + ((U64)dataRight << 32);
>>>>>>> 7efa77614832b96eb78a286b3607a97bfa188276
}
/* Must be scalar. */
cpu_mode = XXH_CPU_MODE_SCALAR;
}
~/xxh3-fix $ ./xxhsum -b -m2
./xxhsum 0.7.0 (64-bits x86_64 + SSE2 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH32 : 102400 -> 49052 it/s ( 4790.3 MB/s)
XXH32 unaligned : 102400 -> 52240 it/s ( 5101.5 MB/s)
XXH64 : 102400 -> 72225 it/s ( 7053.2 MB/s)
XXH64 unaligned : 102400 -> 72505 it/s ( 7080.6 MB/s)
Illegal instruction: 4 102400 ->
~/xxh3-fix $ ./xxhsum -b -m1
./xxhsum 0.7.0 (64-bits x86_64 + SSE2 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH32 : 102400 -> 50042 it/s ( 4886.9 MB/s)
XXH32 unaligned : 102400 -> 49664 it/s ( 4850.0 MB/s)
XXH64 : 102400 -> 75342 it/s ( 7357.6 MB/s)
XXH64 unaligned : 102400 -> 72363 it/s ( 7066.7 MB/s)
XXH3_64bits : 102400 -> 168694 it/s (16474.0 MB/s)
XXH3_64b unaligned : 102400 -> 172788 it/s (16873.8 MB/s)
~/xxh3-fix $ ./xxhsum -b -m0
./xxhsum 0.7.0 (64-bits x86_64 + SSE2 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH32 : 102400 -> 50954 it/s ( 4976.0 MB/s)
XXH32 unaligned : 102400 -> 49851 it/s ( 4868.3 MB/s)
XXH64 : 102400 -> 75145 it/s ( 7338.4 MB/s)
XXH64 unaligned : 102400 -> 71043 it/s ( 6937.8 MB/s)
XXH3_64bits : 102400 -> 87310 it/s ( 8526.4 MB/s)
XXH3_64b unaligned : 102400 -> 85418 it/s ( 8341.6 MB/s)
I added the -m
switch to xxhsum which will select scalar, sse2 or avx2 code paths depending on if it is 0, 1, or 2 respectively
As you can see, -m2
crashes because it tries to use AVX2, -m1
gives me the best performance, which in my case is the SSE2 version, and -m0
only gives me decent performance (no, it's not faster than XXH64, that is a clang bug) because it is the scalar code.
EDIT: In order to use the dispatching, you need to do make MULTI_TARGET=1
.
For MSVC, if someone could try the multitargeting code, this should compile it:
> cl.exe -O2 -c xxhash.c -DXXH_MULTI_TARGET
> cl.exe -O2 -c xxh3-target.c -DXXH_MULTI_TARGET -DXXH_VECTOR=0 -o xxh3-scalar.obj
> cl.exe -O2 -c xxh3-target.c -DXXH_MULTI_TARGET -DXXH_VECTOR=1 -arch:SSE2 -o xxh3-sse2.obj
> cl.exe -O2 -c xxh3-target.c -DXXH_MULTI_TARGET -DXXH_VECTOR=2 -arch:AVX2 -o xxh3-avx2.obj
> cl.exe -O2 -c xxhsum.c -DXXH_MULTI_TARGET
> cl.exe xxhash.obj xxhsum.obj xxh3-scalar.obj xxh3-sse2.obj xxh3-avx2.obj -o xxhsum.exe
Untested for now, my sister had the laptop all day.
I fixed the code for MSVC, and here are the correct commands:
> cl.exe -O2 -c xxhash.c -DXXH_MULTI_TARGET=1
> cl.exe -O2 -c xxh3-target.c -DXXH_MULTI_TARGET=1 -DXXH_VECTOR=0 -Foxxh3-scalar.obj
> cl.exe -O2 -c xxh3-target.c -DXXH_MULTI_TARGET=1 -DXXH_VECTOR=1 -arch:SSE2 -Foxxh3-sse2.obj
> cl.exe -O2 -c xxh3-target.c -DXXH_MULTI_TARGET=1 -DXXH_VECTOR=2 -arch:AVX2 -Foxxh3-avx2.obj
> cl.exe -O2 -c xxhsum.c -DXXH_MULTI_TARGET=1
> cl.exe -O2 xxhsum.obj xxhash.obj xxh3-scalar.obj xxh3-sse2.obj xxh3-avx2.obj
MSVC stands true to its name: MSVC Sucks at Vectorizing Code. The AVX2 code is not much faster than the SSE2 code, while even GCC's version had a notable speedup with AVX2. I think it is not unrolling the loop.
Edit: MSVC for x64 will warn with -arch:SSE2
on non-x86. You can ignore it though.
FYI, regarding the scrambler : we'll keep the new scrambler formula, as it has higher quality.
The new scrambler is fully bijective, while the previous one incurred a little space reduction, for both variants (^
and +
, +
was sensibly better, but not fully bijective).
Note that XXH3
nonetheless met 64-bits collision rate objectives even with the old scrambler, likely because the scrambler stage is "rare" (once per KB), and the final stage involves mixing 8 accumulators (512 bits -> 64 bits), so even if space reduction was started a bit before, it's no longer a big deal after that final mixing.
But I nonetheless value bijective property for the scrambler stage, from a collision rate perspective, it carries a higher quality tag.
When you're done tuning, could you update the test values for me so I know we are on the right page?
Sure, I will.
branch xxh3
will be merged when XXH3_64bits()
get ready,
it will contain all associated test codes.
There will still be some work to do on XXH128()
, but that part can be done separately.
Update on NEON32: Yeah, pragma clang loop unroll is not good on ARM.
I think it has to do with the tiny 64 byte L1 cache line. (And 32 KB instruction cache total)
unroll_count(2) gives me the best performance.
unroll_count(1) (no unroll): 7.4 unroll_count(2): 8.1 unroll_count(4): 6.2 unroll(enable): 5.2
How about we hand unroll the main loop twice and tell Clang to completely unroll on x86?
#if defined(__clang__) && !defined(__OPTIMIZE_SIZE__) && (defined(__i386__) || defined(__x86_64__))
# pragma clang loop unroll(enable)
#endif
for (i = 0; i < NB_KEYS / 2; i++) {
XXH3_accumulate512(...);
XXH3_accumulate512(...);
}
Wait, I think it is because XXH3 might entirely fit in cache that way.
My MacBook also has a 64 byte cache line. :thinking:
As for scalar ARM (thumbv6t2) I am getting 2.1 GB/s right now.
I had to use inline assembly because Clang assumed that the pointer was aligned and caused a bus error.
[termux ~/xxh3] $ ./xxhsum -b
./xxhsum 0.7.0 (32-bits arm little endian), Clang 7.0.1 (tags/RELEASE_701/final), by Yann Collet
Sample of 100 KB...
XXH32 : 102400 -> 20235 it/s ( 1976.1 MB/s)
XXH32 unaligned : 102400 -> 18670 it/s ( 1823.2 MB/s)
XXH64 : 102400 -> 8646 it/s ( 844.3 MB/s)
XXH64 unaligned : 102400 -> 8363 it/s ( 816.7 MB/s)
XXH3_64bits : 102400 -> 21500 it/s ( 2099.6 MB/s)
XXH3_64b unaligned : 102400 -> 21515 it/s ( 2101.0 MB/s)
(Edit: This is on my G3)
Clang assumed that the pointer was aligned and caused a bus error
Does that mean the scalar code path needs to be fixed ? It shouldn't give the compiler a chance to make that mistake.
How about we hand unroll the main loop twice
A side effect is that it would add a restriction on custom vector key size. That's likely an acceptable side effect.
Does that mean the scalar code path needs to be fixed ? It shouldn't give the compiler a chance to make that mistake.
XXH_read32 and XXH_read64 need to be fixed when targeting ARMv6. The scalar code itself appears to be fine.
Clang (edit: and GCC) are adding ldmib
instructions to XXH32
when targeting ARMv6t2, which requires a 32-bit aligned address.
.LBB1_3:
ldr r1, [r6]
ldmib r6, {r0, r7} // <- aaaaaa
ldr r4, [r6, #12]
mla r0, r0, r9, r3
mla r2, r7, r9, r2
mla r4, r4, r9, lr
ror r12, r0, #19
mla r0, r1, r9, r5
ror r7, r2, #19
ror r4, r4, #19
ror r1, r0, #19
mul lr, r4, r10
mul r2, r7, r10
mul r3, r12, r10
mul r5, r1, r10
add r6, r6, #16
cmp r6, r8
blo .LBB1_3
I'm trying to figure out how to fix this without inline assembly right now, as if I use inline assembly, it generates extra add
instructions.
.LBB1_3:
ldr r1, [r0]
mla r1, r1, r10, r7
ror r12, r1, #19
add r1, r0, #12 @ <-
ldr r1, [r1]
mul r7, r12, r3
mla r1, r1, r10, r4
ror r6, r1, #19
add r1, r0, #8 @ <-
ldr r1, [r1]
mul r4, r6, r3
mla r1, r1, r10, r2
ror lr, r1, #19
add r1, r0, #4 @ <-
ldr r1, [r1]
mul r2, lr, r3
mla r1, r1, r10, r5
add r0, r0, #16
cmp r0, r9
ror r8, r1, #19
mul r5, r8, r3
blo .LBB1_3
What does seem to work is to use memcpy
with the compiler flag -munaligned-access
. That reliably disables the evil ldmib
instructions. However, without -munaligned-access
, it will always do byte-by-byte access.
I'll figure it out, don't hurt your head with it.
The alignment memes strike again. 🤦♂️
By the way, does this look correct so far?
XXH_read32 and XXH_read64 need to be fixed when targeting ARMv6.
xxhash is supposed to make the right choice automatically, depending on target. It could be that the set of macros is not correct for ARMv6.
One way to "help" it is to define XXH_FORCE_MEMORY_ACCESS
at compilation time. If one of them work, then it becomes a matter of finding the right set of macros for the automatic detection to work properly on this target.
does this look correct so far?
There are minor changes that happened in the last 2 weeks. I would prefer to leave comments directly in the code, that would make it simpler for you to follow. But that doesn't work on the link provided directly. I guess I'll have to chase the corresponding commit.
Side note: This:
( defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_6J__) \
|| defined(__ARM_ARCH_6K__) || defined(__ARM_ARCH_6Z__) \
|| defined(__ARM_ARCH_6ZK__) || defined(__ARM_ARCH_6T2__) )
( defined(__ARM_ARCH_7__) || defined(__ARM_ARCH_7A__) \
|| defined(__ARM_ARCH_7R__) || defined(__ARM_ARCH_7M__) \
|| defined(__ARM_ARCH_7S__) )
could easily be simplified to this:
(defined(__ARM_ARCH) && __ARM_ARCH == 6)
(defined(__ARM_ARCH) && __ARM_ARCH >= 7)
That would pick up on ARMv7VE (Cortex-A15) and aarch64, and it would look a lot nicer.
Anyways, I'll see what I can do about the memory access. I'm thinking just memcpy
. Once you add -munaligned-access
to the flags, the code speeds up a lot, but without it, it is completely safe from alignment memes, as while it isn't perfect, it is going to be good enough at least for now.
The only common ARMv6 device which is actually being used (and sold!) is the Nintendo 3DS. Nintendo just has a hardcore fetish for obsolete hardware and not giving up on dead consoles, and they have always been a pain.
By the way, I cleaned up my previous attempt at the folded 128-bit multiply for my clean version, this is it.
It is slightly faster than yours on Clang and significantly faster on GCC.
uint64_t mult_hd(uint64_t const lhs, uint64_t const rhs)
{
uint64_t const lo_lo = (lhs & 0xFFFFFFFF) * (rhs & 0xFFFFFFFF);
uint64_t const hi_lo = ((lhs >> 32) * (rhs & 0xFFFFFFFF)) + (lo_lo >> 32);
uint64_t const lo_hi = ((lhs & 0xFFFFFFFF) * (rhs >> 32)) + (hi_lo & 0xFFFFFFFF);
uint64_t const product_hi = ((lhs >> 32) * (rhs >> 32)) + (lo_hi >> 32) + (hi_lo >> 32);
uint64_t const product_lo = (lo_lo & 0xFFFFFFFF) | (lo_hi << 32);
return product_hi ^ product_lo;
}
gcc 8.3.0 -m32 -O3 -march=i686
mult_yann: 3.083638 5278c6468d75cb00
mult_hd: 2.321105 5278c6468d75cb00
mult_njuffa: 2.559749 5278c6468d75cb00
mult_accu: 2.511064 5278c6468d75cb00
mult_code_project: 2.418673 5278c6468d75cb00
mul_botan: 2.533887 5278c6468d75cb00
clang 8.0.0 -O3 -m32 -march=i686
mult_yann: 2.124179 7be8a9a01751f680
mult_hd: 2.022154 7be8a9a01751f680
mult_njuffa: 2.300277 7be8a9a01751f680
mult_accu: 1.980426 7be8a9a01751f680
mult_code_project: 1.974299 7be8a9a01751f680
mul_botan: 2.270755 7be8a9a01751f680
Note: mult_code_project, mult_accu, and mult_hd all generate the same assembly from Clang, mainly because they are all based on the same code:
_mult_hd: ## @mult_hd
## %bb.0:
push ebp
push ebx
push edi
push esi
push eax
mov ecx, dword ptr [esp + 32]
mov ebx, dword ptr [esp + 36]
mov esi, dword ptr [esp + 24]
mov eax, ecx
mul esi
mov edi, edx
mov dword ptr [esp], eax ## 4-byte Spill
mov eax, ecx
mul dword ptr [esp + 28]
mov ecx, edx
mov ebp, eax
mov eax, ebx
mul esi
mov esi, edx
mov ebx, eax
mov eax, dword ptr [esp + 36]
mul dword ptr [esp + 28]
add ebp, edi
adc eax, ecx
adc edx, 0
add ebp, ebx
adc eax, esi
adc edx, 0
xor eax, dword ptr [esp] ## 4-byte Folded Reload
xor edx, ebp
add esp, 4
pop esi
pop edi
pop ebx
pop ebp
ret
Unfortunately, GCC generates considerably worse code on all of them, but this remains the best version:
_mult_hd:
sub esp, 44
mov dword ptr [esp + 28], ebx
mov ecx, dword ptr [esp + 48]
mov dword ptr [esp + 40], ebp
mov ebx, dword ptr [esp + 56]
mov dword ptr [esp + 36], edi
mov ebp, dword ptr [esp + 52]
mov dword ptr [esp + 32], esi
mov eax, ecx
mul ebx
mov dword ptr [esp + 4], edx
mov dword ptr [esp], eax
mov edi, dword ptr [esp + 4]
mov eax, ebx
mul ebp
mov esi, edi
xor edi, edi
add esi, eax
mov eax, ecx
adc edi, edx
mul dword ptr [esp + 60]
mov ecx, eax
mov ebx, edx
mov eax, ebp
xor edx, edx
add ecx, esi
mov dword ptr [esp + 8], ecx
adc ebx, edx
xor esi, esi
mul dword ptr [esp + 60]
mov dword ptr [esp + 12], ebx
mov ebx, dword ptr [esp]
mov ebp, dword ptr [esp + 12]
mov ecx, dword ptr [esp + 8]
mov dword ptr [esp + 16], eax
mov eax, edi
add eax, dword ptr [esp + 16]
mov dword ptr [esp + 20], edx
mov edx, esi
mov edi, ebp
adc edx, dword ptr [esp + 20]
mov esi, edi
xor ebp, ebp
add eax, esi
mov esi, dword ptr [esp + 32]
mov edi, eax
mov eax, ebx
mov ebx, dword ptr [esp + 28]
adc edx, ebp
xor eax, edi
mov ebp, dword ptr [esp + 40]
xor edx, ecx
mov edi, dword ptr [esp + 36]
add esp, 44
ret
A lot of people from the GNU community ask why I dislike GCC a lot.
I personally believe I have some valid reasons.
We can certainly make those changes, they are nice improvements.
As for documentation, here is my idea for XXH3_accumulate_512.
/* This is the main mixer for large data blocks. This is written in SIMD code if possible.
*
* acc = ((data \oplus key) \mod 2 ^ {32} * \frac {data \oplus key} {2 ^ {32}}) + data + acc
*
* This is a modified version of the long multiply method used in UMAC and FARSH.
* The changes are that data and key are xored instead of added, and the original data
* is directly added to the mix after the multiply to prevent multiply-by-zero issues. */
What is your thoughts of putting LaTeX in the comments? Wikipedia has explained some of them that way.
It expands to this:
This is going to be a tracker for discussion, questions, feedback, and analyses about the new XXH3 hashes, found in the
xxh3
branch.@Cyan4973's comments (from
xxhash.h
):XXH3
is a new hash algorithm, featuring vastly improved speed performance for both small and large inputs.A full speed analysis will be published, it requires a lot more space than this comment can handle.
In general, expect
XXH3
to run about ~2x faster on large inputs, and >3x faster on small ones, though exact difference depend on platform.The algorithm is portable, will generate the same hash on all platforms. It benefits greatly from vectorization units, but does not require it.
XXH3
offers 2 variants,_64bits
and_128bits
. The first 64-bits field of the_128bits
variant is the same as_64bits
result. However, if only 64-bits are needed, prefer calling the_64bits
variant. It reduces the amount of mixing, resulting in faster speed on small inputs.The
XXH3
algorithm is still considered experimental. It's possible to use it for ephemeral data, but avoid storing long-term values for later re-use. While labelled experimental, the produced result can still change between versions.The API currently supports one-shot hashing only. The full version will include streaming capability, and canonical representation Long term optional feature may include custom secret keys, and secret key generation.
There are still a number of opened questions that community can influence during the experimental period. I'm trying to list a few of them below, though don't consider this list as complete.
XXH64()
(aka big-endian).XXH32
/XXH64
, but may be more natural for little-endian platforms.XXH128_hash_t
?XXH128_hash_t
which would be desirable ?_128bits
variant is the same as the result of_64bits
.XXH128_hash_t
, in ways which may block other possibilities.doubleSeed
).XXH128
?len==0
: Currently, the result of hashing a zero-length input is the seed.XXH32
/XXH64
).