Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

bithacks: optimize code that counts bits #1511

Closed Quuxplusone closed 11 years ago

Quuxplusone commented 17 years ago
Bugzilla Link PR1488
Status RESOLVED LATER
Importance P enhancement
Reported by Nick Lewycky (nicholas@mxc.ca)
Reported on 2007-06-02 20:35:19 -0700
Last modified on 2013-09-23 17:45:35 -0700
Version trunk
Hardware All All
CC clattner@nondot.org, efriedma@quicinc.com, llvm-bugs@lists.llvm.org, nadav256@gmail.com, shuxin.llvm@gmail.com
Fixed by commit(s)
Attachments
Blocks PR17101
Blocked by
See also
The common idiom for counting the number of bits set can be optimized.

unsigned countbits_slow(unsigned v)
{
  unsigned c;
  for (c = 0; v; v >>= 1)
  {
    c += v & 1;
  }
  return c;
}

This can be rewritten as such:

unsigned countbits_fast(unsigned v)
{
  unsigned c;
  for (c = 0; v; c++)
  {
    v &= v - 1; // clear the least significant bit set
  }
  return c;
}

There are other ways to rewrite the bit-counting code, but they involve other
tradeoffs such as constant arrays. Also, the naive method may be written such
that it starts with the high bit or low bit for performance, if the programmer
happens to know about the distribution of the bits in advance. countbits_fast
will equal or outperform either implementation.

With stock GCC, countbits_slow has a 63% longer run-time in a tight loop, while
with llvm-gcc, countbits_slow has a 45% longer run-time. (Sadly, llvm-gcc is
slower than GCC at either function on x86 Linux, but that's another bug.)
Quuxplusone commented 17 years ago
I forgot to mention that this is also the llvm.ctpop.* intrinsic. We should
match the code better and, at least on x86, emit it better.
Quuxplusone commented 17 years ago

The right way to do this is to recognize the loop and turn it into a call to llvm.ctpop

Quuxplusone commented 17 years ago
fwiw, 186.crafty contains this loop:

  int PopCnt(register BITBOARD a)
  {
    register int c=0;
    while(a) {
      c++;
      a &= a - 1;
    }
    return(c);
  }

BITBOARD = unsigned long long

If you're interested in matching these patterns, matching this one to
llvm.popcnt would result in a huge
speedup for crafty.

-Chris
Quuxplusone commented 17 years ago
I just got this on IRC:

  unsigned int popcount(unsigned int input) {
    unsigned int count = 0;
    for (unsigned int i =  0; i < 4 * 8; i++)
      count += (input >> i) & i;
    return count;
  }

with the question, "why doesn't loop-unroll unroll my loop?".
Quuxplusone commented 17 years ago

_Bug 1792 has been marked as a duplicate of this bug._

Quuxplusone commented 16 years ago
(In reply to comment #3)
> If you're interested in matching these patterns, matching this one to
llvm.popcnt would result in a huge
> speedup for crafty.

For crafty in particular, it actually has asm popcount/ctlz/cttx functions (if
you pass in the right flags on X86).  And the asm implementations are faster
than what LLVM currently generates.
Quuxplusone commented 16 years ago
(In reply to comment #6)
> (In reply to comment #3)
> > If you're interested in matching these patterns, matching this one to
llvm.popcnt would result in a huge
> > speedup for crafty.
>
> For crafty in particular, it actually has asm popcount/ctlz/cttx functions (if
> you pass in the right flags on X86).  And the asm implementations are faster
> than what LLVM currently generates.
>

This requires SSE4 support:
  http://en.wikipedia.org/wiki/SSE4#SSE4.2
Quuxplusone commented 16 years ago

I'm not aware of any shipping cpu with SSE4.2. Penryn etc only have 4.1

Quuxplusone commented 16 years ago
(In reply to comment #7)
> This requires SSE4 support:
>   http://en.wikipedia.org/wiki/SSE4#SSE4.2

crafty is actually just using a simple loop for popcount on X86; I think it
ends up being faster than __builtin_popcount because the popcount is usually
small.
Quuxplusone commented 16 years ago

I moved these ideas into lib/Target/README.txt

Quuxplusone commented 11 years ago
We are now able to recognize following patten.  Since Popcount recognization is
architecture-dependent
optimization. It is enabled only if the underlying architecture has HW support.
ScalarTargetTransformInfo::getPopcntHwSupport() serves as the interface between
the LoopIdiomRecognize pass and HW support.

On x8864, currently popcount optimization is trigged if -msse4 is turned on (or
other flags that implies sse4).  The CodeGen will emit POPCNT instruction (in
SSE4 instruction set).

According to "http://www.strchr.com/crc32_popcnt", the POPCNT instruction is
not as faster as the SSE3 implementation.

In the near future, we need to implement popcount HW support using SSE3
instruction.

 int PopCnt(register BITBOARD a)
  {
    register int c=0;
    while(a) {
      c++;
      a &= a - 1;
    }
    return(c);
  }