Closed easyaspi314 closed 5 years ago
Yes, you are right @easyaspi314 , and this is a known issue. I plan to add a dedicate code to emulate that for 32-bit platforms (and non-gcc ones). There are already multiple implementation available, so I should be able to grab and plug one.
Actually, the scalar one doesn't look too bad.
GCC still hates the code and insists on using the stack (or an ugly partial vectorization for the first one), but Clang emits clean code for everyone. It emits the best code for ARMv7 though, but that is mainly because of the accumulate instructions.
umaal
is a complete beast of an instruction added in ARMv6:
void umaal(uint32_t *RdLo, uint32_t *RdHi, uint32_t Rn, uint32_t Rm)
{
uint64_t prod = (uint64_t) Rn * (uint64_t) Rm;
prod += (uint64_t) *RdLo;
prod += (uint64_t) *RdHi;
*RdLo = prod & 0xFFFFFFFF;
*RdHi = prod >> 32;
}
Here's a comparison of a few implementations I found online. https://gcc.godbolt.org/z/pWh9No
The second one might be better because GCC doesn't vectorize it and it isn't terrible on ARM (it only adds an extra instruction). On MSVC, we want to do that intrinsic instead of cast because MSVC is stupid and will try to do a full 64-bit multiply.
I've uploaded xxh3
branch with a new version of xxh3.h
which contains an update for mul128
function on non-x64 platforms.
I could check it compiles and run properly in 32-bits mode.
It also contains a dedicated path for ARM aarch
, though I believe the existing gcc
has higher priority and will likely produce the same result (haven't yet verified generated assembly). You might be able to understand this part better.
https://gcc.godbolt.org/z/PRtMJy
Boom. Money.
37-48 instructions on x86, 8-9 instructions on ARMv6+, and thanks to __attribute__((__target__("no-sse2")))
, GCC doesn't partially vectorize it when tuning for Core 2.
I had to use inline assembly for ARM because I couldn't get Clang or GCC to figure out my request for umaal
. Unfortunately, yeah, that mess of #if
statements is kinda sucky.
umull r12, lr, r0, r2 @ {r12, lr} = (U64)r0 * (U64)r2
mov r5, #0 @ r5 = 0
mov r4, #0 @ r4 = 0
umaal r5, lr, r1, r2 @ {r5, lr} = ((U64)r1 * (U64)r2) + r5 + lr
umaal r5, r4, r0, r3 @ {r5, r4} = ((U64)r0 * (U64)r3) + r5 + r4
umaal lr, r4, r1, r3 @ {lr, r4} = ((U64)r1 * (U64)r3) + lr + r4
adds r0, lr, r12 @ <-.
@ {r0, r1} = (U64){lr, r4} + (U64){r12, r5}
adc r1, r4, r5 @ <-'
I don't think I can optimize it more than that. That's only 8 instructions.
I added XXH_mult32to64
which expands to the __emulu
intrinsic on MSVC which makes it so MSVC is far less likely to output a __allmul
call (although to be fair it doesn't do it in this case).
I still think the x86 can be optimized more, although it might just be the fixed output registers that causes all that shuffling. It actually runs quite fast:
uint32_t val32[4];
uint64_t *val = val32;
uint64_t sum = 0;
srand(0);
double start = (double)clock();
for (int i = 0; i < 10000000; i++) {
val32[0] = rand();
val32[1] = rand();
val32[2] = rand();
val32[3] = rand();
sum += XXH_mul128AndFold_32(val[0], val[1]);
}
double end = (double)clock();
printf("%lld %lfs\n", sum, (end - start) / CLOCKS_PER_SEC);
7652620537862933594 0.454625s
For comparison, this is in 64-bit mode with __uint128_t
:
7652620537862933594 0.336406
Only 35% slower considering how much more work it does.
Side note: Clang optimizes the __uint128_t
version to this on aarch64, so that is probably best to use it. Fused multiply and add is always preferred.
umulh x8, x1, x0 // x8 = ((__uint128_t)x1 * x0) >> 64
madd x0, x1, x0, x8 // x0 = (x1 * x0) + x8;
Looks great !
It's definitely slower on ARM despite the significantly fewer instructions, but that is mainly due to the rand() call being pretty sophisticated on Bionic.
967456348838854209 5.572055s
Replace it with uninitialised data, force it in a register and use -fno-inline
to keep it from cheating and it is very fast:
5764888493227865371 0.119198s
(without the rand
calls on x86, it is 0.076690s
my implementation vs 0.031167
native x86_64 which makes more sense. Still pretty fast though.)
I noticed that you added
__uint128_t
to the xxh3. That is not portable to 32-bit; the type is only defined on 64-bit, and as you see below, for a pretty decent reason.I suggest this:
However, I do want to mention that GCC doesn't really like this code, and it half vectorizes it. I'm trying out a potential SSE2/NEON32 version, though, which has a chance of being faster or slower. The first part I got from MSDN, but the second part confuses me.
If I do it with vector extensions, Clang gives me this:
However, in the middle it switches to scalar. I kinda wish I had NEON's half registers…