Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Optimize __builtin_parity #2118

Closed Quuxplusone closed 3 years ago

Quuxplusone commented 16 years ago
Bugzilla Link PR1983
Status RESOLVED FIXED
Importance P enhancement
Reported by Eli Friedman (efriedma@quicinc.com)
Reported on 2008-02-04 22:30:54 -0800
Last modified on 2020-10-10 12:12:53 -0700
Version trunk
Hardware PC Linux
CC baldrick@free.fr, craig.topper@gmail.com, gonzalo.gadeschi@gmail.com, i@maskray.me, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com, stoklund@2pi.dk
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also PR46954
Spun off of Bug 1979, because I don't want to pollute that bug with a lot of
irrelevant discussion.

(In reply to comment #2)
> Eli: some random unrelated thoughts :).  If you're interested in parity, I
> wonder if llvm-gcc generates efficient code for __builtin_parity.  Could you
> check?

It appears llvm-gcc doesn't support __builtin_parity... at least the version at
http://llvm.org/demo/index.cgi (I don't have a trunk build).

LLVM currently generates bad code for llvm.ctpop.i32(x) & 1; it just generates
the ctpop and ands it with 1, which is really slow for cpu's without a popcount
instruction.

(In reply to comment #3)
> It would also be useful to teach instcombine about various idioms for parity,
> turning it into llvm.popct(x)&1

What exactly are common idioms for parity? The only idioms that would be easy
to detect are popcount & 1 and maybe the pure shift+xor method.

I've dome some rough testing of various ways of computing parity using a
benchmark that looks like the testcase for this bug.

For x86, something like the following is probably best; it appears to be as
fast as a lookup table.

static unsigned parity2(unsigned x) {
  int result;
  asm ("movl    %1, %%ecx\n"
       "shrl    $16, %%ecx\n"
       "xorl    %1, %%ecx\n"
       "xorb    %%ch, %%cl\n"
       "setnp   %%cl\n"
       "movzbl  %%cl, %0\n"
        :"=r"(result)    /* output */
        :"g" (x)         /* input */
        :"%ecx"         /* clobbered register */
       );
  return result;
}

I don't know how difficult it would be to make LLVM generate something like
that, though.

Besides that, the fastest method is a lookup table; however, a full lookup
table is at least 256 bytes.  Without a lookup table, the method from my
testcase is the fastest. The method from
http://graphics.stanford.edu/~seander/bithacks.html#ParityParallel is a little
slower. Purely using xor+shift is a little slower than that (it adds up to
being about twice as slow as the asm or table lookup).

That said, the way I'm benchmarking is not at all representative of real-world
code, so my results might be a bit off.

The version of gcc I have converts __builtin_parity into a call to __paritysi2,
a method in libgcc which uses a lookup table; however, I think that's changed
(http://www.archivum.info/gcc-patches@gcc.gnu.org/2007-02/msg00999.html).

Is there actually any real-world use for 32-bit parity, though?  I just looked
at it because I was curious...
Quuxplusone commented 16 years ago

llvm-gcc 2.2 does support __builtin_parity: it turns it into popcnt(x)&1 at the LLVM level, which is a reasonable thing to canonicalize it into. I don't know if there are any real-world uses of parity.

Quuxplusone commented 16 years ago
FWIW, the code we're generating for __builtin_parity on x86 is truly awful.  We
should generate the optimized x86 code from Eli's suggestion, or, if not that,
a call to __paritysi2 and friends.

It would be good to have the target-independent expander turn parity into
something like this:

int parity(unsigned x) {
  x ^= x >> 1;
  x ^= x >> 2;
  return ((x & 0x11111111) * 0x88888888) >> 31;
}

or equivalent, which compiles into relatively decent code.  Right now on x86 we
get this, which is very awful:

_par2:
    pushl   %ebp
    movl    %esp, %ebp
    movl    8(%ebp), %eax
    movl    %eax, %ecx
    andl    $1431655765, %ecx
    shrl    %eax
    andl    $1431655765, %eax
    addl    %eax, %ecx
    movl    %ecx, %eax
    andl    $858993459, %eax
    shrl    $2, %ecx
    andl    $858993459, %ecx
    addl    %ecx, %eax
    movl    %eax, %ecx
    andl    $252645135, %ecx
    shrl    $4, %eax
    andl    $252645135, %eax
    addl    %eax, %ecx
    movl    %ecx, %eax
    andl    $16711935, %eax
    shrl    $8, %ecx
    andl    $16711935, %ecx
    addl    %ecx, %eax
    movl    %eax, %ecx
    shrl    $16, %ecx
    andl    $65535, %eax
    addl    %ecx, %eax
    andl    $1, %eax
    popl    %ebp
    ret
Quuxplusone commented 13 years ago

The parity function is useful when working with polynomials and vector spaces over GF(2).

This comes up in communication theory (scramblers and error correcting codes) as well as cryptography (mostly older military ciphers, but also A5/1 used in current GSM systems and E0 used in Bluetooth).

I suppose the CRC is the only thing that escaped the world of telecom.

Quuxplusone commented 8 years ago

Any improvements on this or any chance of providing a __builtin_parity intrinsic just like GCC does?

Quuxplusone commented 3 years ago
Current codegen on x86:
https://godbolt.org/z/j8Ga63

        mov     ecx, edi
        shr     ecx, 16
        xor     ecx, edi
        xor     eax, eax
        xor     cl, ch
        setnp   al

Or:
        popcnt  eax, edi
        and     eax, 1

It's been 12+ years since popcnt was added to mainstream x86 (Nehalem,
Barcelona) - as old as this bug. :)

Either way, I think we're optimal on x86 now. Resolve as fixed?