Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

[AVX] opportunity for better code by transforming vec w/one element used to scalar #11923

Open Quuxplusone opened 12 years ago

Quuxplusone commented 12 years ago
Bugzilla Link PR11775
Status NEW
Importance P enhancement
Reported by Matt Pharr (matt@pharr.org)
Reported on 2012-01-16 18:36:10 -0800
Last modified on 2019-03-16 07:01:04 -0700
Version trunk
Hardware PC All
CC llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, nadav.rotem@me.com, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments bad.ll (3592 bytes, application/octet-stream)
Blocks
Blocked by
See also PR39973
Created attachment 7887
examples

The attached test case has two versions of a loop over float values in memory,
where each time through the loop an 8-wide vector of floats is loaded and added
to an accumulated <8 x float> sum, which the function returns.

In the first version, foo(), the %iter_val42342 value is a vector that has
value <0,1,2,3,4,5,6,7> the first time through the loop, <8,9,10,...> the
second time through, and so forth.  As it turns out, this value is only used in
an extractelement instruction, the result of which is used to index into the
array of floats.

Here is the generated code for the loop body (with top of tree, llc -
mattr=+avx):

LBB0_1:                                 ## %foreach_full_body
                                        ## =>This Inner Loop Header: Depth=1
    addl    $8, %ecx
    vmovd   %ecx, %xmm3
    vinsertf128 $1, %xmm3, %ymm3, %ymm3
    vpermilps   $0, %ymm3, %ymm3 ## ymm3 = ymm3[0,0,0,0,4,4,4,4]
    vmovd   %xmm2, %edx
    shll    $2, %edx
    movslq  %edx, %rdx
    vmovups (%rdi,%rdx), %ymm2
    vaddps  %ymm2, %ymm0, %ymm0
    vextractf128    $1, %ymm3, %xmm2
    vextractf128    $1, %ymm1, %xmm4
    vpaddd  %xmm4, %xmm2, %xmm2
    vpaddd  %xmm1, %xmm3, %xmm3
    cmpl    %eax, %ecx
    vinsertf128 $1, %xmm2, %ymm3, %ymm2
    jl  LBB0_1

The code is going through all of the work to maintain all of the vector values,
even though only one is needed (doubly-painful with AVX and only 4-wide integer
instructions.)  This is also inhibiting other optimizations.

In the bar() function in the attached, I've manually transformed this vector
into a scalar value.  The resulting code is much nicer.

LBB1_1:                                 ## %foreach_full_body
                                        ## =>This Inner Loop Header: Depth=1
    movslq  %ecx, %rcx
    vmovups (%rdi,%rcx), %ymm1
    vaddps  %ymm1, %ymm0, %ymm0
    addl    $32, %ecx
    addl    $8, %edx
    cmpl    %eax, %edx
    jl  LBB1_1

This suggests that it might be worthwhile to look for computations on vectors
where only one of the elements is used, and to lower these down to the
corresponding scalar computation if possible.
Quuxplusone commented 12 years ago

Attached bad.ll (3592 bytes, application/octet-stream): examples

Quuxplusone commented 12 years ago

I see two possible optimizations:

The code below is NO-OP. A simple InstCombine optimization should eliminate it.

%iter_val42 = add <8 x i32> %smear_counter41, <i32 0, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>

Second, reducing vectors to scalars requires a more sophisticated optimization, especially in cases where the vector goes into a PHI node. I also encountered a similar problem in user code.

Quuxplusone commented 12 years ago

I ended up writing some code to do the transformation to scalar code in ispc; the relevant stuff is here: https://github.com/ispc/ispc/blob/master/llvmutil.cpp#L1392-1508. (BSD license, so feel free to grab anything useful.)

Quuxplusone commented 5 years ago
(In reply to Nadav Rotem from comment #1)
> I see two possible optimizations:
>
> The code below is NO-OP.  A simple InstCombine optimization should eliminate
> it.
>
> %iter_val42 = add <8 x i32> %smear_counter41, <i32 0, i32 undef, i32 undef,
> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>

6+ years later...
This (and many similar other) splat-with-undef patterns are folded in IR. In
this case, instsimplify can kill the add.

> Second, reducing vectors to scalars requires a more sophisticated
> optimization, especially in cases where the vector goes into a PHI node.  I
> also encountered a similar problem in user code.

This still doesn't happen. We have several scalarization folds for
extractelement in instcombine, but not one that would hoist the extract ahead
of a phi.
Quuxplusone commented 5 years ago

Current codegen: https://godbolt.org/z/coNFgE