Loading register + operation

tkoenig1 commented 1 year ago

My favorite division hack for 128-bit integers.

void
div_rem_13 (__uint128_t n, __uint128_t *div, unsigned int *rem)
{
  const __uint128_t magic = TWO_64 * 14189803133622732012u + 5675921253449092805u * ONE; /* 0xC4EC4EC4EC4EC4EC4EC4EC4EC4EC4EC5 */
  unsigned int a, b, c;
  unsigned int r;

  a = n & MASK60;
  b = (n >> 60);
  b = b & MASK60;
  c = (n >> 120);
  r = (a+b+c) % 13;
  n = n - r;
  *div = n * magic;
  *rem = r;
}

yields

div_rem_13:                             ; @div_rem_13
; %bb.0:
        sll     r5,r2,<0:4>
        srl     r6,r1,<0:60>
        or      r5,r6,r5
        srl     r6,r2,<0:56>
        add     r5,r5,r1
        add     r5,r5,r6
        srl     r5,r5,<32:0>
        carry   r6,{O}
        div     r5,r5,#13
        carry   r5,{O,-,IO}
        add     r1,r1,-r6
        mov     r7,#0
        add     r2,r2,-r7
        mov     r5,#5675921253449092805
        carry   r7,{O}
        mul     r5,r1,r5
        mul     r1,r1,#-4256940940086819604
        add     r1,r7,r1
        mul     r2,r2,#5675921253449092805
        add     r1,r1,r2
        std     r5,[r3]
        std     r1,[r3,8]
        stw     r6,[r4]
        ret

Two comments:

The instructions

        mov     r7,#0
        add     r2,r2,-r7

could be left out (r7 is overwritten due to the later carry).

And, for

        mov     r5,#5675921253449092805
        carry   r7,{O}
        mul     r5,r1,r5

the constant could be moved into the mul instruction.

This is with the most recent compiler, with the non-checking version, with the usual compile script

a=${1%%.[ci]}
b=${a}_opt
clang -fverbose-asm -c --target=my66000 -O3 -fno-vectorize -fno-slp-vectorize  -emit-llvm -fno-unroll-loops $1
opt  -disable-loop-unrolling -O3  --march=my66000 --frame-pointer=none --enable-vvm $a.bc  > $b.bc
llc --disable-lsr --enable-predication --enable-predication2 --enable-carry-generation --early-carry-coalesce --enable-vvm -march=my66000 $b.bc

The Debug build is still compiling :-)

tkoenig1 commented 1 year ago

Division by 13 is Issue #30 .

tkoenig1 commented 1 year ago

Ah, I overlooked the carry.

        carry   r5,{O,-,IO}
        add     r1,r1,-r6
        mov     r7,#0
        add     r2,r2,-r7

could be

        carry   r5,{O}
        add     r1,r1,-r6
        add     r2,r2,r5

since R7 isn't reused (unless I misunderstand carry, which is entirely possible).

bagel99 commented 1 year ago

What is MASK60? Are a,b,c,r really just 32-bits?

tkoenig1 commented 1 year ago

What is MASK60? Are a,b,c,r really just 32-bits?

The answer the second question first: No, they are not. Change error because, for some reason, the original version failed compilation due to some header mismatch. And as for the first one, I didn't post the complete program.

Here is the complete test case:

#include <stdint.h>
typedef __uint128_t mytype;

#define ONE ((__uint128_t) 1)
#define TWO_64 (ONE << 64)
#define MASK60 ((1ul << 60) - 1)

void
div_rem_13 (mytype n, mytype *div, unsigned int *rem)
{
  const mytype magic = TWO_64 * 14189803133622732012u + 5675921253449092805u * ONE; /* 0xC4EC4EC4EC4EC4EC4EC4EC4EC4EC4EC5 */
  __uint64_t a, b, c;
  unsigned int r;

  a = n & MASK60;
  b = (n >> 60);
  b = b & MASK60;
  c = (n >> 120);
  r = (a+b+c) % 13;
  n = n - r;
  *div = n * magic;
  *rem = r;
}

Assembly is

div_rem_13:                             ; @div_rem_13
; %bb.0:                                ; %entry
        srl     r5,r1,<60:0>
        sll     r6,r2,<0:4>
        srl     r7,r1,<0:60>
        or      r6,r7,r6
        and     r6,r6,#1152921504606846975
        srl     r7,r2,<0:56>
        add     r5,r5,r7
        add     r5,r5,r6
        carry   r6,{O}
        div     r5,r5,#13
        carry   r5,{O,-,IO}
        add     r1,r1,-r6
        mov     r7,#0
        add     r2,r2,-r7
        mov     r5,#5675921253449092805
        carry   r7,{O}
        mul     r5,r1,r5
        mul     r1,r1,#-4256940940086819604
        add     r1,r7,r1
        mul     r2,r2,#5675921253449092805
        add     r1,r1,r2
        std     r5,[r3]
        std     r1,[r3,8]
        stw     r6,[r4]
        ret

tkoenig1 commented 1 year ago

As an explanation of what this does: This first calculates the remainder of the division by 13 by taking advantage of the fact that 2^60 mod 13 = 1, and then uses the algorithm from Hacker's Delight for division when the remainder is known to be zero.

bagel99 commented 1 year ago

There seems to be (at least) 3 opportunities for improvement here.

and r6,r6,#1152921504606846975 should be srl r6,r6,<60:0>
the sub of n-r has extra instructions as you noted. Its a 128-bit minus 64-bit and the carry out is not used.
the 128-bit by 128-bit multiply could be better, as you noted. I may break out two of these into separate issues if I can fix them more easily.

bagel99 commented 1 year ago

Commit 9f1575dce18acaa0c24c19b657259cd769c3962f makes things better but not perfect.

bagel99 commented 1 year ago

Try again with commit af7ee91ac91db0ac489c686eb35daf15ceb1f32f

tkoenig1 commented 1 year ago

Code is now

        srl     r5,r1,<60:0>
        sll     r6,r2,<0:4>
        srl     r7,r1,<0:60>
        or      r6,r7,r6
        srl     r6,r6,<60:0>
        srl     r7,r2,<0:56>
        add     r5,r5,r7
        add     r5,r5,r6
        carry   r6,{O}
        mul     r7,r5,#5675921253449092805
        srl     r6,r6,<0:2>
        mul     r6,r6,#13
        add     r5,r5,-r6
        carry   r6,{O,I}
        add     r1,r1,-r5
        add     r2,r2,#0
        mul     r6,r1,#-4256940940086819604
        carry   r7,{O}
        mul     r1,r1,#5675921253449092805
        add     r6,r7,r6
        mul     r2,r2,#5675921253449092805
        add     r2,r6,r2
        std     r1,[r3]
        std     r2,[r3,8]
        stw     r5,[r4]
        ret

which is quite good.

If optimization for size was required, it might make sense to put #5675921253449092805 into a register. I am actually not sure how to ask llc to optimize for size, it only offers -O0 to -O3 as far as I can see.

bagel99 commented 1 year ago

The backends don't accept -Os or -Oz. One must use those flags with clang. Clang will then tag each function with the attribute "minsize" which is the clue to the backends.

tkoenig1 commented 1 year ago

Hm, I do not get the division with the constant 13 with either -Os or -Oz. Compile script is

#! /bin/bash
a=${1%%.[ci]}
b=${a}_s
clang -fverbose-asm -c --target=my66000 -Oz -fno-vectorize -fno-slp-vectorize  -emit-llvm -fno-unroll-loops -fomit-frame-pointer $1
opt  -disable-loop-unrolling -Oz  --march=my66000 --frame-pointer=none --enable-vvm $a.bc  > $b.bc
llc -O0 --disable-lsr --enable-predication --enable-predication2 --enable-carry-generation --early-carry-coalesce --enable-vvm -march=my66000 $b.bc

The "opt" pass seems to be removing the minsize attribute, despite the -Oz.

If I just use clang and llc, with

#! /bin/bash
a=${1%%.[ci]}
clang -fverbose-asm -c --target=my66000 -Oz -fno-vectorize -fno-slp-vectorize  -emit-llvm -fno-unroll-loops -fomit-frame-pointer $1
llc -O2 --disable-lsr --enable-predication --enable-predication2 --enable-carry-generation --early-carry-coalesce --enable-vvm -march=my66000 $a.bc

I get

        srl     r5,r1,<60:0>
        sll     r6,r2,<0:4>
        srl     r7,r1,<0:60>
        or      r6,r7,r6
        srl     r6,r6,<60:0>
        srl     r7,r2,<0:56>
        add     r5,r5,r7
        add     r5,r5,r6
        carry   r6,{O}
        div     r5,r5,#13
        carry   r5,{O,I}
        add     r1,r1,-r6
        add     r2,r2,#0
        mul     r5,r1,#-4256940940086819604
        carry   r7,{O}
        mul     r1,r1,#5675921253449092805
        add     r5,r7,r5
        mul     r2,r2,#5675921253449092805
        add     r2,r5,r2
        std     r1,[r3]
        std     r2,[r3,8]
        stw     r6,[r4]
        ret

which is probably as good as it is going to get.

So, closing (in gcc land, I would call this RESOLVED FIXED).

bagel99 / llvm-my66000

Loading register + operation #31