Closed tkoenig1 closed 1 year ago
Division by 13 is Issue #30 .
Ah, I overlooked the carry.
carry r5,{O,-,IO}
add r1,r1,-r6
mov r7,#0
add r2,r2,-r7
could be
carry r5,{O}
add r1,r1,-r6
add r2,r2,r5
since R7 isn't reused (unless I misunderstand carry, which is entirely possible).
What is MASK60? Are a,b,c,r really just 32-bits?
What is MASK60? Are a,b,c,r really just 32-bits?
The answer the second question first: No, they are not. Change error because, for some reason, the original version failed compilation due to some header mismatch. And as for the first one, I didn't post the complete program.
Here is the complete test case:
#include <stdint.h>
typedef __uint128_t mytype;
#define ONE ((__uint128_t) 1)
#define TWO_64 (ONE << 64)
#define MASK60 ((1ul << 60) - 1)
void
div_rem_13 (mytype n, mytype *div, unsigned int *rem)
{
const mytype magic = TWO_64 * 14189803133622732012u + 5675921253449092805u * ONE; /* 0xC4EC4EC4EC4EC4EC4EC4EC4EC4EC4EC5 */
__uint64_t a, b, c;
unsigned int r;
a = n & MASK60;
b = (n >> 60);
b = b & MASK60;
c = (n >> 120);
r = (a+b+c) % 13;
n = n - r;
*div = n * magic;
*rem = r;
}
Assembly is
div_rem_13: ; @div_rem_13
; %bb.0: ; %entry
srl r5,r1,<60:0>
sll r6,r2,<0:4>
srl r7,r1,<0:60>
or r6,r7,r6
and r6,r6,#1152921504606846975
srl r7,r2,<0:56>
add r5,r5,r7
add r5,r5,r6
carry r6,{O}
div r5,r5,#13
carry r5,{O,-,IO}
add r1,r1,-r6
mov r7,#0
add r2,r2,-r7
mov r5,#5675921253449092805
carry r7,{O}
mul r5,r1,r5
mul r1,r1,#-4256940940086819604
add r1,r7,r1
mul r2,r2,#5675921253449092805
add r1,r1,r2
std r5,[r3]
std r1,[r3,8]
stw r6,[r4]
ret
As an explanation of what this does: This first calculates the remainder of the division by 13 by taking advantage of the fact that 2^60 mod 13 = 1, and then uses the algorithm from Hacker's Delight for division when the remainder is known to be zero.
There seems to be (at least) 3 opportunities for improvement here.
Commit 9f1575dce18acaa0c24c19b657259cd769c3962f makes things better but not perfect.
Try again with commit af7ee91ac91db0ac489c686eb35daf15ceb1f32f
Code is now
srl r5,r1,<60:0>
sll r6,r2,<0:4>
srl r7,r1,<0:60>
or r6,r7,r6
srl r6,r6,<60:0>
srl r7,r2,<0:56>
add r5,r5,r7
add r5,r5,r6
carry r6,{O}
mul r7,r5,#5675921253449092805
srl r6,r6,<0:2>
mul r6,r6,#13
add r5,r5,-r6
carry r6,{O,I}
add r1,r1,-r5
add r2,r2,#0
mul r6,r1,#-4256940940086819604
carry r7,{O}
mul r1,r1,#5675921253449092805
add r6,r7,r6
mul r2,r2,#5675921253449092805
add r2,r6,r2
std r1,[r3]
std r2,[r3,8]
stw r5,[r4]
ret
which is quite good.
If optimization for size was required, it might make sense to put #5675921253449092805
into a register. I am actually not sure how to ask llc to optimize for size, it only offers -O0
to -O3
as far as I can see.
The backends don't accept -Os or -Oz. One must use those flags with clang. Clang will then tag each function with the attribute "minsize" which is the clue to the backends.
Hm, I do not get the division with the constant 13 with either -Os or -Oz. Compile script is
#! /bin/bash
a=${1%%.[ci]}
b=${a}_s
clang -fverbose-asm -c --target=my66000 -Oz -fno-vectorize -fno-slp-vectorize -emit-llvm -fno-unroll-loops -fomit-frame-pointer $1
opt -disable-loop-unrolling -Oz --march=my66000 --frame-pointer=none --enable-vvm $a.bc > $b.bc
llc -O0 --disable-lsr --enable-predication --enable-predication2 --enable-carry-generation --early-carry-coalesce --enable-vvm -march=my66000 $b.bc
The "opt" pass seems to be removing the minsize attribute, despite the -Oz
.
If I just use clang and llc, with
#! /bin/bash
a=${1%%.[ci]}
clang -fverbose-asm -c --target=my66000 -Oz -fno-vectorize -fno-slp-vectorize -emit-llvm -fno-unroll-loops -fomit-frame-pointer $1
llc -O2 --disable-lsr --enable-predication --enable-predication2 --enable-carry-generation --early-carry-coalesce --enable-vvm -march=my66000 $a.bc
I get
srl r5,r1,<60:0>
sll r6,r2,<0:4>
srl r7,r1,<0:60>
or r6,r7,r6
srl r6,r6,<60:0>
srl r7,r2,<0:56>
add r5,r5,r7
add r5,r5,r6
carry r6,{O}
div r5,r5,#13
carry r5,{O,I}
add r1,r1,-r6
add r2,r2,#0
mul r5,r1,#-4256940940086819604
carry r7,{O}
mul r1,r1,#5675921253449092805
add r5,r7,r5
mul r2,r2,#5675921253449092805
add r2,r5,r2
std r1,[r3]
std r2,[r3,8]
stw r6,[r4]
ret
which is probably as good as it is going to get.
So, closing (in gcc land, I would call this RESOLVED FIXED).
My favorite division hack for 128-bit integers.
yields
Two comments:
The instructions
could be left out (r7 is overwritten due to the later carry).
And, for
the constant could be moved into the
mul
instruction.This is with the most recent compiler, with the non-checking version, with the usual compile script
The Debug build is still compiling :-)