Flang/LLVM not vectorizing loop

kiranchandramohan commented 5 years ago

The code given below contains a module with an allocatable variable having the target attribute set. This module variable is used inside the loop and the loop is not vectorized by Flang/LLVM.

module m1
  integer, Allocatable, Target, Dimension(:) :: arr
end module

program mn
use m1
allocate(arr(100))
do i=1,100
  arr(i) = i
end do
end program

Observations 1) TBAA information is not generated for the address load and the store to the array when compiled with optimizations (O3). 2) When the target attribute is removed this TBAA information is generated. 3) The address load is possibly loop-invariant and can be hoisted out of the loop, but LLVM does not have enough information to do this. 4) This loop is vectorized by gfortran, ifort.

kiranchandramohan commented 5 years ago

LLVM files generated by flang2 with and without the target attribute.

tv1.ll.txt tv1-notarget.ll.txt

flang-cavium commented 5 years ago

The problem is actually in flang1. Attached are 3 diffs for outputs from flang1:

tv1-flang1.cmod.diff
tv1-flang1.ilm.diff
tv1-flang1.stb.diff tv1-flang1.cmod.diff.txt tv1-flang1.ilm.diff.txt tv1-flang1.stb.diff.txt

flang-cavium commented 5 years ago

Try the attached patch -- flang1-semfin.c-target-vector.patch.txt.

Output from -fsave-optimization-record is in tv1-f90.opt.yaml.txt. You can clearly see the vectorization happening.
LLVM IR output is in tv1-f90.ll.txt. flang1-semfin.c-target-vector.patch.txt tv1-f90.opt.yaml.txt tv1-f90.ll.txt

kiranchandramohan commented 5 years ago

Thanks @flang-cavium for working on this. I will look into the patch that you provided today.

kiranchandramohan commented 5 years ago

@flang-cavium , I can confirm that with this patch this loop is vectorized. This patch also helped to vectorize one of the application benchmarks which had this issue. I will now try to get this patch through some internal tests and also test with other applications which had similar vectorisation issues. Thanks once again for working on this.

flang-cavium commented 5 years ago

@flang-cavium , I can confirm that with this patch this loop is vectorized. This patch also helped to vectorize one of the application benchmarks which had this issue.

That is a Good Thing[tm].

schweitzpgi commented 5 years ago

TBAA information is not generated for the address load and the store to the array when compiled with optimizations (O3).

When the target attribute is removed this TBAA information is generated.

These two details are related. The TARGET attribute, as we know, should only be put on variables that can be/will be pointed to by variables with a POINTER attribute. TARGET explicitly tells the compiler to be pessimistic as the variable may be aliased in some way. (Fortran has the opposite sense of C here; Fortran compilers can assume variables do not alias unless the programmer says otherwise.)

The address load is possibly loop-invariant and can be hoisted out of the loop, but LLVM does not have enough information to do this.

Yes, this is why LLVM's vectorizer is having trouble. arr is also ALLOCATABLE as well as a MODULE variable, so Flang is placing the object in a block of memory. The ALLOCATE statement will construct the various parts of arr inside that block of memory, but the layout is not exposed to the LLVM optimizer. The block just looks like an array of bytes.

This loop is vectorized by gfortran, ifort.

It is also vectorized by pgfortran.

Try the attached patch -- flang1-semfin.c-target-vector.patch.txt.

This patch gets around the aforementioned issues by throwing away the TARGET attribute on arr (and somewhat by happenstance). It's generally incorrect to assume any variable with TARGET is not going to be aliased somehow, but the front-end will force this assumption if the variable with the TARGET attribute can only be referenced in the current scope and the variable is not pointer assigned in the current scope. But in this case, arr is a global, so this local analysis is insufficient to prove there is no pointer aliasing in the general case.

Two approaches can be entertained to solve this problem correctly.

Do a global analysis that determines that all USEs of arr in the whole program do not introduce aliasing within a scope. (In general, this is a hard problem. Consider derived types that have polymorphic pointers, allocations that escape a scope, and separate compilation, for example.)
Decompose the "block of bytes" into an LLVM strongly typed product type and clean up all the references to the subcomponents to also be strongly typed. The goal would be for flang to do its own SROA-like decomposition and to not present memory accesses as bare i8* arithmetic. If I do these transformations by hand, this loop is fully unrolled and vectorized.

    movaps  .LCPI1_2(%rip), %xmm0   # xmm0 = [1,2,3,4]
    movups  %xmm0, _m1_0_-68(,%rax,4)
    movaps  .LCPI1_3(%rip), %xmm0   # xmm0 = [5,6,7,8]
    movups  %xmm0, _m1_0_-52(,%rax,4)
    movaps  .LCPI1_4(%rip), %xmm0   # xmm0 = [9,10,11,12]
    movups  %xmm0, _m1_0_-36(,%rax,4)
    movaps  .LCPI1_5(%rip), %xmm0   # xmm0 = [13,14,15,16]
    movups  %xmm0, _m1_0_-20(,%rax,4)
    movaps  .LCPI1_6(%rip), %xmm0   # xmm0 = [17,18,19,20]
    movups  %xmm0, _m1_0_-4(,%rax,4)
    movaps  .LCPI1_7(%rip), %xmm0   # xmm0 = [21,22,23,24]
    movups  %xmm0, _m1_0_+12(,%rax,4)
    movaps  .LCPI1_8(%rip), %xmm0   # xmm0 = [25,26,27,28]
    movups  %xmm0, _m1_0_+28(,%rax,4)
    movaps  .LCPI1_9(%rip), %xmm0   # xmm0 = [29,30,31,32]
    movups  %xmm0, _m1_0_+44(,%rax,4)
    movaps  .LCPI1_10(%rip), %xmm0  # xmm0 = [33,34,35,36]
    movups  %xmm0, _m1_0_+60(,%rax,4)
    movaps  .LCPI1_11(%rip), %xmm0  # xmm0 = [37,38,39,40]
    movups  %xmm0, _m1_0_+76(,%rax,4)
    movaps  .LCPI1_12(%rip), %xmm0  # xmm0 = [41,42,43,44]
    movups  %xmm0, _m1_0_+92(,%rax,4)
    movaps  .LCPI1_13(%rip), %xmm0  # xmm0 = [45,46,47,48]
    movups  %xmm0, _m1_0_+108(,%rax,4)
    movaps  .LCPI1_14(%rip), %xmm0  # xmm0 = [49,50,51,52]
    movups  %xmm0, _m1_0_+124(,%rax,4)
    movaps  .LCPI1_15(%rip), %xmm0  # xmm0 = [53,54,55,56]
    movups  %xmm0, _m1_0_+140(,%rax,4)
    movaps  .LCPI1_16(%rip), %xmm0  # xmm0 = [57,58,59,60]
    movups  %xmm0, _m1_0_+156(,%rax,4)
    movaps  .LCPI1_17(%rip), %xmm0  # xmm0 = [61,62,63,64]
    movups  %xmm0, _m1_0_+172(,%rax,4)
    movaps  .LCPI1_18(%rip), %xmm0  # xmm0 = [65,66,67,68]
    movups  %xmm0, _m1_0_+188(,%rax,4)
    movaps  .LCPI1_19(%rip), %xmm0  # xmm0 = [69,70,71,72]
    movups  %xmm0, _m1_0_+204(,%rax,4)
    movaps  .LCPI1_20(%rip), %xmm0  # xmm0 = [73,74,75,76]
    movups  %xmm0, _m1_0_+220(,%rax,4)
    movaps  .LCPI1_21(%rip), %xmm0  # xmm0 = [77,78,79,80]
    movups  %xmm0, _m1_0_+236(,%rax,4)
    movaps  .LCPI1_22(%rip), %xmm0  # xmm0 = [81,82,83,84]
    movups  %xmm0, _m1_0_+252(,%rax,4)
    movaps  .LCPI1_23(%rip), %xmm0  # xmm0 = [85,86,87,88]
    movups  %xmm0, _m1_0_+268(,%rax,4)
    movaps  .LCPI1_24(%rip), %xmm0  # xmm0 = [89,90,91,92]
    movups  %xmm0, _m1_0_+284(,%rax,4)
    movaps  .LCPI1_25(%rip), %xmm0  # xmm0 = [93,94,95,96]
    movups  %xmm0, _m1_0_+300(,%rax,4)
    movaps  .LCPI1_26(%rip), %xmm0  # xmm0 = [97,98,99,100]
    movups  %xmm0, _m1_0_+316(,%rax,4)

flang-cavium commented 5 years ago

But in this case, arr is a global, so this local analysis is insufficient to prove there is no pointer aliasing in the general case.

Why is this correct for static arrays, which is precisely what the existing flang1 code does?

Why is pointer aliasing an inherent problem in the first place? The only case where pointer aliasing would be a problem would be if it was done through incompatible pointer types.

Does Fortran even allow pointer aliasing through incompatible pointer types in the first place?

kiranchandramohan commented 5 years ago

Thanks @schweitzpgi for having a look and commenting in detail. Thanks @flang-cavium for the follow-up question.

I was trying to create a program which would show an issue with @flang-cavium's patch but I was not able to come up with one. My plan was to create an inter-iteration dependency due to aliasing and have the array with the target attribute and the pointer pointing it to be in separate modules like in the following code.

module m1
  integer, Allocatable, Target, Dimension(:) :: arr
  integer, parameter :: sz = 1000
end module

module m2
  use m1
  integer, pointer, Dimension(:) :: ptr
contains
subroutine setup()
  allocate(arr(sz))
  ptr = arr
end subroutine
end module

program mn
use m2
do i=2,sz
  arr(i) = ptr(i-1) + i
end do
end program

mvinay commented 5 years ago

@flang-cavium , I can confirm that with this patch this loop is vectorized. This patch also helped to vectorize one of the application benchmarks which had this issue. I will now try to get this patch through some internal tests and also test with other applications which had similar vectorisation issues. Thanks once again for working on this.

TBAA metadata doesn't get generated whenever there is a pointer assignment statement for the same array in the same routine. For example,

module m1
  integer, Allocatable, Target, Dimension(:) :: arr
end module

program pgm
  use m1
  integer, pointer, Dimension(:) :: ptr
  integer :: i = 2
  ptr => arr   ! <<--- Pointer assignment.
  i  = arr(i)
end program

In the above example, Flang doesn't produce TBAA info for arr(i) access! But once the assignment is commented out, the patch seems to work fine.

schweitzpgi commented 5 years ago

Hi Kiran,

The following code shows aliasing on array arr.

module mod1
  integer, allocatable, target, dimension(:) :: arr
  integer, parameter :: size = 1000
contains
  subroutine init_arr
    allocate (arr(size))
    do i = 1,size
       arr(i) = i
    end do
  end subroutine init_arr

  subroutine access_at(ptr, off)
    integer off
    integer, pointer :: ptr(:)
    ptr => arr(off:)
  end subroutine access_at

  subroutine dump(size)
    integer size
    print *, "*** dump arr ***"
    do i = 1, size
       print *, i, arr(i)
    end do
    print *, ""
  end subroutine dump
end module mod1

module mod2
contains
  subroutine test1
    use mod1
    integer, pointer :: p(:)
    integer, pointer :: q(:)
    call init_arr
    call access_at(p, 1)
    call access_at(q, 2)
    do i = 1, 200
       q(i) = p(i) + 100
    end do
    call dump(50)
  end subroutine test1
end module mod2

program p
  use mod2
  call test1
end program p

If we look at subroutine test1 and comment out that p and q have pointer attributes, the compiler could, for example, vectorize the loop in test1. However, p and q do have the pointer attribute, so the compiler must be more conservative. It cannot vectorize the loop.

Now note that there is no difference between p and arr. Each is a name for the same array in storage. So, if we instead wrote the body of the loop in test1 as

    q(i) = arr(i) + 100

the compiler should generate code that produces the same results. In this case, arr has the target attribute (not a pointer), but it can and does alias with p and q. Therefore, it is clearly not safe for the compiler to remove the target attribute on the global/common symbol arr.

flang-cavium commented 5 years ago

Therefore, it is clearly not safe for the compiler to remove the target attribute on the global/common symbol arr.

And vectorization does not happen in either case:

q(i) = p(i) + 100

or

q(i) = arr(i) + 100

--- !Analysis
Pass:            loop-vectorize
Name:            CantIdentifyArrayBounds
DebugLoc:        { File: t4.f90, Line: 39, Column: 1 }
Function:        mod2_test1_
Args:            
  - String:          'loop not vectorized: '
  - String:          cannot identify array bounds
...
--- !Missed
Pass:            loop-vectorize
Name:            MissedDetails
DebugLoc:        { File: t4.f90, Line: 39, Column: 1 }
Function:        mod2_test1_
Args:            
  - String:          loop not vectorized
...

kiranchandramohan commented 5 years ago

Thanks @schweitzpgi for your comments and the example. After going through the example that you provided and I agree with your argument that q and arr should not alias and hence the target attribute on arr should not be removed.

@flang-cavium I also do not see any vectorisation. But this is probably due to other issues. Isn't @schweitzpgi 's point that telling LLVM that "q" and "arr" do not alias in "q(i) = arr(i) + 100" incorrect?

Thanks @mvinay for your comment. Yes, I did observe that having "arr" on the RHS of a pointer assignment prevents the target attribute from being dropped. I wanted to try the case (as in the example i posted in a comment above) where the pointer assignment happens in a different scope.

flang-cavium commented 5 years ago

But this is probably due to other issues.

Which other issues?

Isn't @schweitzpgi 's point that telling LLVM that "q" and "arr" do not alias in "q(i) = arr(i) + 100" incorrect?

LLVM figured out on its own that "q" and "arr" should not alias in "q(i) = arr(i) + 100", so I don't see how @schweitzpgi's point would be relevant. It may be correct in theory, but not relevant.

Pointer aliasing rules. A construct like the one in @schweitzpgi's example won't alias in C either.

flang-compiler / flang

Flang/LLVM not vectorizing loop #640