Non-optimal result for SSE min() vectorization

Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Non-optimal result for SSE min() vectorization #35469

Open Quuxplusone opened 6 years ago

Quuxplusone commented 6 years ago


Bugzilla Link	PR36496
Status	NEW
Importance	P enhancement
Reported by	Tim Shen (timshen91@gmail.com)
Reported on	2018-02-23 18:05:45 -0800
Last modified on	2018-02-24 10:13:38 -0800
Version	trunk
Hardware	PC Linux
CC	craig.topper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments	`b.ll` (811 bytes, text/plain)
Blocks
Blocked by
See also

Created attachment 19946
Test

The test shows that changing "a < b ? a : b" to "b < a ? b : a" causes a
codegen difference, even though semantically they are the same.

Quuxplusone commented 6 years ago

Attached b.ll (811 bytes, text/plain): Test

Quuxplusone commented 6 years ago

Generated assembly:

Min1:
    movdqa  %xmm0, %xmm2
    movdqa  %xmm1, %xmm0
    pcmpgtq %xmm2, %xmm0
    blendvpd    %xmm0, %xmm2, %xmm1
    movapd  %xmm1, %xmm0
    retq

Min2:
    movdqa  %xmm0, %xmm2
    pcmpgtq %xmm1, %xmm0
    blendvpd    %xmm0, %xmm1, %xmm2
    movapd  %xmm2, %xmm0
    retq

Quuxplusone commented 6 years ago

We're constrained quite a bit by the ABI and available instructions there. a will always have to start in xmm0 and b will always have to start in xmm1. The return value must go into xmm0. The blendvpd instructions always uses xmm0 for the blend control. It's not an encodable field of the instruction's binary representation. There is no pcmplt instruction we're forced to use pcmpgt.

The SelectionDAG process is largely blissfully unaware of the register constraints. So we don't know there's value in exploiting the equality case. We won't realize the problem until register allocation , but by then its too late to rewrite all the instructions.

Quuxplusone commented 6 years ago

It doesn't solve the problem in general, but we could argue that the reduced
case is a missed IR canonicalization for commutative ops.

Take a simpler pair of examples:

define i32 @binop_commute_1(i32 %a, i32 %b) {
  %r = mul i32 %a, %b
  ret i32 %r
}

define float @binop_commute_2(float %a, float %b) {
  %r = fmul float %b, %a
  ret float %r
}

InstCombine doesn't do anything with these, but -reassociate does...and for
some reason with split personality:

$ ./opt -reassociate min.ll -S

define i32 @binop_commute_1(i32 %a, i32 %b) {
  %r = mul i32 %b, %a
  ret i32 %r
}

define float @binop_commute_2(float %a, float %b) {
  %r = fmul float %a, %b
  ret float %r
}

Quuxplusone commented 6 years ago

This reminds me of [Bug #27780] where we need to invert the pblendvb blend mask to allow us to commute the blend inputs.

Is there any way that we can safely adjust multiple instructions through X86InstrInfo::commuteInstructionImpl to perform the commutations? Altering the comparison in this bug, or the constant pool blend mask in PR27780.