Closed Quuxplusone closed 11 years ago
Bugzilla Link | PR1488 |
Status | RESOLVED LATER |
Importance | P enhancement |
Reported by | Nick Lewycky (nicholas@mxc.ca) |
Reported on | 2007-06-02 20:35:19 -0700 |
Last modified on | 2013-09-23 17:45:35 -0700 |
Version | trunk |
Hardware | All All |
CC | clattner@nondot.org, efriedma@quicinc.com, llvm-bugs@lists.llvm.org, nadav256@gmail.com, shuxin.llvm@gmail.com |
Fixed by commit(s) | |
Attachments | |
Blocks | PR17101 |
Blocked by | |
See also |
I forgot to mention that this is also the llvm.ctpop.* intrinsic. We should
match the code better and, at least on x86, emit it better.
The right way to do this is to recognize the loop and turn it into a call to llvm.ctpop
fwiw, 186.crafty contains this loop:
int PopCnt(register BITBOARD a)
{
register int c=0;
while(a) {
c++;
a &= a - 1;
}
return(c);
}
BITBOARD = unsigned long long
If you're interested in matching these patterns, matching this one to
llvm.popcnt would result in a huge
speedup for crafty.
-Chris
I just got this on IRC:
unsigned int popcount(unsigned int input) {
unsigned int count = 0;
for (unsigned int i = 0; i < 4 * 8; i++)
count += (input >> i) & i;
return count;
}
with the question, "why doesn't loop-unroll unroll my loop?".
_Bug 1792 has been marked as a duplicate of this bug._
(In reply to comment #3)
> If you're interested in matching these patterns, matching this one to
llvm.popcnt would result in a huge
> speedup for crafty.
For crafty in particular, it actually has asm popcount/ctlz/cttx functions (if
you pass in the right flags on X86). And the asm implementations are faster
than what LLVM currently generates.
(In reply to comment #6)
> (In reply to comment #3)
> > If you're interested in matching these patterns, matching this one to
llvm.popcnt would result in a huge
> > speedup for crafty.
>
> For crafty in particular, it actually has asm popcount/ctlz/cttx functions (if
> you pass in the right flags on X86). And the asm implementations are faster
> than what LLVM currently generates.
>
This requires SSE4 support:
http://en.wikipedia.org/wiki/SSE4#SSE4.2
I'm not aware of any shipping cpu with SSE4.2. Penryn etc only have 4.1
(In reply to comment #7)
> This requires SSE4 support:
> http://en.wikipedia.org/wiki/SSE4#SSE4.2
crafty is actually just using a simple loop for popcount on X86; I think it
ends up being faster than __builtin_popcount because the popcount is usually
small.
I moved these ideas into lib/Target/README.txt
We are now able to recognize following patten. Since Popcount recognization is
architecture-dependent
optimization. It is enabled only if the underlying architecture has HW support.
ScalarTargetTransformInfo::getPopcntHwSupport() serves as the interface between
the LoopIdiomRecognize pass and HW support.
On x8864, currently popcount optimization is trigged if -msse4 is turned on (or
other flags that implies sse4). The CodeGen will emit POPCNT instruction (in
SSE4 instruction set).
According to "http://www.strchr.com/crc32_popcnt", the POPCNT instruction is
not as faster as the SSE3 implementation.
In the near future, we need to implement popcount HW support using SSE3
instruction.
int PopCnt(register BITBOARD a)
{
register int c=0;
while(a) {
c++;
a &= a - 1;
}
return(c);
}