Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Missed optimization of shifts into roll/rorl when avoiding undefined behavior #17331

Closed Quuxplusone closed 5 years ago

Quuxplusone commented 11 years ago
Bugzilla Link PR17332
Status RESOLVED FIXED
Importance P normal
Reported by jonathan.sauer@gmx.de
Reported on 2013-09-23 08:06:28 -0700
Last modified on 2018-11-11 07:10:57 -0800
Version trunk
Hardware PC All
CC benny.kra@gmail.com, craig.topper@gmail.com, efriedma@quicinc.com, jingu.kang@arm.com, kkhoo@perfwizard.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, noloader@gmail.com, regehr@cs.utah.edu, richard-llvm@metafoo.co.uk, spatel+llvm@rotateright.com, trass3r@gmail.com
Fixed by commit(s)
Attachments roll_rorl.txt (792 bytes, text/plain)
no_roll_rorl.txt (1066 bytes, text/plain)
rolr.s (2813 bytes, application/octet-stream)
Blocks
Blocked by
See also PR17975, PR16726, PR20750, PR34046, PR34924, PR38152, PR37387
Created attachment 11268
Disassembly of first rol/ror (no masking, roll and rorl)

The following two functions rotate an unsigned integer to the left and the
right, respectively:

unsigned int rol(unsigned int value, unsigned int amount)
{
    return (value << amount) | (value >> (32 - amount));
}

unsigned int ror(unsigned int value, unsigned int amount)
{
    return (value >> amount) | (value << (32 - amount));
}

On x86_64, clang r191183 with -O2 compiles them into a rol/ror instruction
(disassembly attached):

% ~/LLVM/build/Release+Asserts/bin/clang++ -S -O2 -o - clang.cpp

Unfortunately the code results in undefined behavior when <amount> is zero, as
then <value> gets shifted by 32 bits which, assuming <unsigned int> has a size
of 32 bits, is undefined behavior according to §5.8p1:

| The behavior is undefined if the right operand is negative, or greater than
| or equal to the length in bits of the promoted left operand

Of course it's easy to fix the functions by masking the lower five bits of
<amount>, which in this case results in no functionality change:

unsigned int rol(unsigned int value, unsigned int amount)
{
    return (value << amount) | (value >> ((32 - amount) & 31));
}

unsigned int ror(unsigned int value, unsigned int amount)
{
    return (value >> amount) | (value << ((32 - amount) & 31));
}

This however does not result in a rol/ror instruction (disassembly attached);
instead the code is compiled into two shifts, a negation and an or.

Now I'm a bit stuck between a rock and a hard place: On the one hand I want to
avoid undefined behavior (especially when using clang's UB sanitizer), on the
other hand I'd like clang to generate the optimum code.
Quuxplusone commented 11 years ago

Attached roll_rorl.txt (792 bytes, text/plain): Disassembly of first rol/ror (no masking, roll and rorl)

Quuxplusone commented 11 years ago

Attached no_roll_rorl.txt (1066 bytes, text/plain): Disassembly of second rol/ror (masking, no roll/rorl)

Quuxplusone commented 11 years ago
To truly avoid undefined behavior, don't you need to mask both uses of 'amount'?

Eg:
unsigned int ror(unsigned int value, unsigned int amount) {
    unsigned int amount_masked = amount & 31;
    return (value >> amount_masked) | (value << (32 - amount_masked));
}

This is what the compiler should be doing internally anyway?

I'm not saying your example code shouldn't be optimized as well, but the above
code generates a 'ror' for me, so there is a work-around if you're currently
stuck.
Quuxplusone commented 11 years ago
(In reply to comment #2)
> To truly avoid undefined behavior, don't you need to mask both uses of
> 'amount'?

I was thinking that it would be easier to avoid calling rol/ror with an amount
of bigger than 31 than with an amount of zero. But you're right: The amount of
the first shift also has to be masked to make the functions completely UB-safe.

> I'm not saying your example code shouldn't be optimized as well, but the
> above code generates a 'ror' for me, so there is a work-around if you're
> currently stuck.

Indeed it does. However clang inserts an <andl $31> before the <rol/ror> to
mask the amount to five bits, despite rol/ror doing this by itself according to
the pseudo-code in
<http://download.intel.com/products/processor/manual/325383.pdf> (page 936):

| (* ROL and ROR instructions *)
| IF OperandSize = 64
|   THEN COUNTMASK = 3FH;
|   ELSE COUNTMASK = 1FH;
| FI;
| (* ROL instruction operation *)
| tempCOUNT ← (COUNT & COUNTMASK) MOD SIZE

AMD's CPUs perform identical masking:
<http://support.amd.com/us/Processor_TechDocs/24594_APM_v3.pdf> page 311.

So I'd guess clang/LLVM should learn about the masking performed by rol/ror so
it can remove redundant masking operations.

(according to the manual the only Intel CPU that does not mask the amount is
the 8086)

Nevertheless, thank you for your suggestion.
Quuxplusone commented 11 years ago
Thanks for the pointers to the reference docs.

This shortcoming has been noted since at least 3/16/2008 (r48438) - in
lib/Target/README.txt:

"We miss a bunch of rotate opportunities on various targets, including ppc, x86,
etc.  On X86, we miss a bunch of 'rotate by variable' cases because the rotate
matching code in dag combine doesn't look through truncates aggressively
enough.  Here are some testcases reduces from GCC PR17886..."

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17886

Based on the README comment, I think the fix will go in
DAGCombiner::MatchRotate() in lib/CodeGen/SelectionDAG/DAGCombiner.cpp.
Quuxplusone commented 11 years ago
I've been familiarizing myself with DAGCombiner (never looked at it before this
week).

I hacked some code into MatchRotate to try to peel off additional masking
operations, but now I'm wondering if that's the best way to solve this bug.

A simple shift with mask:

unsigned int foo(unsigned int x, unsigned int y) {
    return x << (y & 31);
}

Has the 'and' op optimized away:
## BB#0:                                ## %entry
    movb    %sil, %cl
    shll    %cl, %edi
    movl    %edi, %eax
    ret

So why isn't the 'and' operation optimized away for the rotate case? It might
be that the code is optimized to 'rotate' before the 'shift' code optimizer
gets to run, so the 'and' gets left behind.
Quuxplusone commented 11 years ago
(In reply to comment #5)
> So why isn't the 'and' operation optimized away for the rotate case? It
> might be that the code is optimized to 'rotate' before the 'shift' code
> optimizer gets to run, so the 'and' gets left behind.

I think the shift with mask gets optimized not inside DAGCombiner, but inside
the X86-specific backend. There's a pattern in X86InstrCompiler.td at lines
1519 that seems to do it:

| // (shl x (and y, 31)) ==> (shl x, y)
| def : Pat<(shl GR8:$src1, (and CL, immShift32)),
|          (SHL8rCL GR8:$src1)>;
| [...]
| def : Pat<(srl GR8:$src1, (and CL, immShift32)),
|          (SHR8rCL GR8:$src1)>;

(etc for sra and also a mask value of 63 for the shifting of 64 bit integers)

Since the DAGCombiner is run before the instruction selection (I guess, I'm
only an amateur when it comes to LLVM), the shifts have already been replaced
by rol/ror when the instruction selection gets a look at the code.

So I guess fixing this bug means adding patterns to X86InstrCompiler.td (and
other targets that wrap the amount of bits to shift; it is target-specific
after all).
Quuxplusone commented 11 years ago
(In reply to comment #6)
> So I guess fixing this bug means adding patterns to X86InstrCompiler.td (and
> other targets that wrap the amount of bits to shift; it is target-specific
> after all).

I was hoping to avoid target-specific changes because I think the C language
limitation that you quoted:

| The behavior is undefined if the right operand is negative, or greater than
| or equal to the length in bits of the promoted left operand

means that we should be able to do this optimization for any target.
Quuxplusone commented 11 years ago
(In reply to comment #7)
> I was hoping to avoid target-specific changes because I think the C language
> limitation that you quoted:
>
> | The behavior is undefined if the right operand is negative, or greater than
> | or equal to the length in bits of the promoted left operand
>
> means that we should be able to do this optimization for any target.

You are right[*]. Furthermore I doubt there is an architecture that uses more
than the absolute minimum of bits in its rol/ror instructions to store the
amount (when rotating by an immediate instead of a register or memory
location). So LLVM has to deal with overflow anyway when emitting the machine
code.

[*] Although I'm not sure if LLVM can depend on C's definition of undefined
behavior as AFAIK LLVM is not limited to C-like languages (contrary to clang).
Quuxplusone commented 11 years ago
(In reply to comment #8)
> [*] Although I'm not sure if LLVM can depend on C's definition of undefined
> behavior as AFAIK LLVM is not limited to C-like languages (contrary to
> clang).

Good point. I don't know how that kind of decision is made either. Should this
source-language constraint of C/C++ be specified as an attribute in the IR?
Maybe it is already, and I just don't see it?

If not, I think the current logic in DAGCombiner::MatchRotate() is already
implicitly assuming that it has free license to optimize without checking to
see if the shift variable is greater than the operand length in bits.

I'm cc'ing Craig and Eli to see if they can offer any advice.
Quuxplusone commented 11 years ago

Just noticed that Benjamin Kramer and JinGu Kang have recently modified MatchRotate - cc'ing them too in case they can provide any input.

Quuxplusone commented 11 years ago

Shifts in LLVM (and SelectionDAG) are undefined if the shift amount is larger than the type width. Just dropping the AND at this stage would be invalid.

The optimization described here could be handled by recognizing the "& 31" in dagcombiner. If there was a masking operation we could add it to the rotate amount when doing the transform and avoid introducing undefined behavior. Implementing this with the current rotate matching code is going to be ugly, it probably needs to be refactored first.

Quuxplusone commented 11 years ago
Thanks for the advice, Benjamin!

There's an additional complication with recognizing the code as shown in
Jonathan's original description:

| unsigned int rol_with_mask_right(unsigned int x, unsigned int y) {
|   return (x << y) | (x >> ((32 - y) & 31));
| }

Clang changes that to:

$ ./clang -S -O3 -fomit-frame-pointer ror.c -emit-llvm -o -
...
define i32 @rol_with_mask_right(i32 %x, i32 %y) #0 {
entry:
  %shl = shl i32 %x, %y
  %0 = sub i32 0, %y       <---- subtract from 32 changed to subtract from 0
  %and = and i32 %0, 31
  %shr = lshr i32 %x, %and
  %or = or i32 %shr, %shl
  ret i32 %or
}

So in the case where we detect a mask, we would have to adjust the subsequent
detection of the subtract operand.
Quuxplusone commented 10 years ago

_Bug 17904 has been marked as a duplicate of this bug._

Quuxplusone commented 10 years ago

So we don't have to keep clicking back to the duplicate bug: here's another source-level implementation for safe-rotate that LLVM should handle:

define ROTL32(n,x) ((n)==0?(x):(((x)<<(n)) | ((x)>>(32-(n)))))

Quuxplusone commented 10 years ago

Attached rolr.s (2813 bytes, application/octet-stream): Improved code generation with clang r199654

Quuxplusone commented 9 years ago
> When compiled with:
> % ~/LLVM/build/Release+Asserts/bin/clang++ -S -O2 -o - clang.cpp
>
> It results in the attached bitcode. Only rotCond is not optimal as clang
> leaves the test against zero.

For software with security requirements, this might be less than ideal. The
test/branch adds an additional code path, which might help adversaries gain an
advantage with timing attacks. (Folks like Daniel J. Bernstein take great care
to ensure these types of attacks are not easily carried out).

The same is true for the Microsoft compatible rotates provided in Intrin.h.
Intrin.h should probably use the improved rotates, and not the legacy ones.
Quuxplusone commented 5 years ago
Just use the branchless (x << (n % width) | x >> (-n % width)) pattern.
Also used by clang internally to emit code for the _rotl MSVC intrinsic.
See https://bugs.llvm.org/show_bug.cgi?id=39624.
Quuxplusone commented 5 years ago
(In reply to Trass3r from comment #17)
> Just use the branchless (x << (n % width) | x >> (-n % width)) pattern.
> Also used by clang internally to emit code for the _rotl MSVC intrinsic.
> See https://bugs.llvm.org/show_bug.cgi?id=39624.

Let me know if I'm missing some use case, but if you're compiling with clang,
the ideal C/C++ source-level solution to ensure you get rotate asm output is to
use the rotate builtins described under here:
http://clang.llvm.org/docs/LanguageExtensions.html#builtin-functions

The patch that added those was here:
https://reviews.llvm.org/D50924

There's still an open item about translating the existing *Microsoft*
extensions to the LLVM intrinsics. That's tracked in bug 37387.