Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

suboptimal code generation in loop unrolling #11382

Open Quuxplusone opened 12 years ago

Quuxplusone commented 12 years ago
Bugzilla Link PR11196
Status NEW
Importance P normal
Reported by s3734770@mail.zih.tu-dresden.de
Reported on 2011-10-20 14:01:41 -0700
Last modified on 2020-08-04 22:29:34 -0700
Version trunk
Hardware All All
CC atrick@apple.com, baldrick@free.fr, bhavana.kilambi@blackfigtech.com, geek4civic@gmail.com, llvm-bugs@lists.llvm.org, nicholas@mxc.ca, nlewycky@google.com
Fixed by commit(s)
Attachments y.ll (4683 bytes, application/octet-stream)
Blocks
Blocked by
See also
Created attachment 7495
test case

I attached a llvm-ir translation of the following program:
static init
begin
  if(copy('hello', 1, 2) == 'el') {
    log('correct')
  }
end

This program should be completely constant folded, but the optimization stucks
at a loop:
the loop begins at line 51 and compares two strings of the same size.
As you see in line 61+63 and 62+64, the characters of the strings at this
specific position are loaded and this exactly is the problem.

It is about the same issue like http://llvm.org/bugs/show_bug.cgi?id=11142
but this time, it's a bit more complex: the memory is not directly copied, but
it's received by a getelementptr+load from the malloced memory.
The challenge is to proof that the loop will stay exactly in the copied range
of [3 x i8]* @label416 so that the getelementptr call can fetched from
@label416 instead of %result.i.i10

An other question is why the loop that only consists of two iterations is not
constant folded.
The loop has the following structure:

start:
 br label %test
test:
 %loopvar = phi i32 [%loopvar2, %body], [0, %start]
 %finished = icmp sgt i32 %loopvar, 2 ; or any other constant or non-constant
 br i1 %finished, label %continue, label %body
body:
 ; sth fancy
 %otherbreak = icmp .....
 br i1 %otherbreak, label %exception, label %test
continue:

This report is maybe two bugs but when you fix one of the two issues, the test
case will become unusable to test the other issue because of constant folding.
Quuxplusone commented 12 years ago

Attached y.ll (4683 bytes, application/octet-stream): test case

Quuxplusone commented 12 years ago
There is something very strange in your testcase:

  declare i32 @relocate_kernel(i32, i32, i32, i32, i32)
  @__LOAD_PHYSICAL_ADDR = external global [0 x i8]
[...]
  call void @llvm.memcpy.p0i8.p0i8.i32(i8* %call8, i8* inttoptr (i32 add (i32 add (i32 ptrtoint (i32 (i32, i32, i32, i32, i32)* @relocate_kernel to i32), i32 ptrtoint ([0 x i8]* @__LOAD_PHYSICAL_ADDR to i32)), i32 -1073741824) to i8*), i32 2048, i32 1, i1 false)

What in the world that constant expression argument to memcpy!? The address of
the function relocate_kernel plus the address of external global
__LOAD_PHYSICAL_ADDR plus -2^30 (aka. 0xC0000000)?
Quuxplusone commented 12 years ago
(In reply to comment #1)
> There is something very strange in your testcase:
>
>   declare i32 @relocate_kernel(i32, i32, i32, i32, i32)
>   @__LOAD_PHYSICAL_ADDR = external global [0 x i8]
> [...]
>   call void @llvm.memcpy.p0i8.p0i8.i32(i8* %call8, i8* inttoptr (i32 add (i32
> add (i32 ptrtoint (i32 (i32, i32, i32, i32, i32)* @relocate_kernel to i32),
i32
> ptrtoint ([0 x i8]* @__LOAD_PHYSICAL_ADDR to i32)), i32 -1073741824) to i8*),
> i32 2048, i32 1, i1 false)
>
> What in the world that constant expression argument to memcpy!? The address of
> the function relocate_kernel plus the address of external global
> __LOAD_PHYSICAL_ADDR plus -2^30 (aka. 0xC0000000)?

I just find   tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* %6, i8*
getelementptr inbounds ([3 x i8]* @label416, i64 0, i64 0), i32 3, i32 1, i1
false) nounwind

Where do you see the words relocate, kernel etc.?
Are you sure you opened the right file?
Quuxplusone commented 12 years ago
I could reduce it to a very small testcase:
(you can replace the target datalayout by your own)

target datalayout = "e-p:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-
f32:32:32-f64:32:64-f80:32:32-v64:64:64-v128:128:128-a0:0:64"

@char = internal constant [2 x i8] c"g " ; change this string to test the other
case
@hasspace = internal constant [15 x i8] c"it has a space\00"
@nospace = internal constant [16 x i8] c"it has no space\00"
declare void @puts(i8*)
declare i8* @malloc(i32)

define void @main() {
start:
 %a = call i8*(i32)* @malloc(i32 4)
 store i8 32, i8* %a
 br label %test
test:
 %loopvar = phi i32 [%loopvar2, %body], [0, %start]
 %loopvar2 = add i32 %loopvar, 1
 %finished = icmp sgt i32 %loopvar, 2 ; or any other constant or non-constant
 br i1 %finished, label %continue, label %body
body:
 ; sth fancy
 %valp = getelementptr [2 x i8]* @char, i32 0, i32 %loopvar
 %val = load i8* %valp
 %val2 = load i8* %a
 %otherbreak = icmp eq i8 %val, %val2
 br i1 %otherbreak, label %exception, label %test
continue:
 call void(i8*)* @puts(i8* getelementptr([16 x i8]* @nospace, i32 0, i32 0))
 ret void

exception:
 call void(i8*)* @puts(i8* getelementptr([15 x i8]* @hasspace, i32 0, i32 0))
 ret void
}
Quuxplusone commented 12 years ago
Ok, using comment 3 as the testcase. This is an ordering issue. "opt -gvn -loop-
unroll -simplifycfg" produces:

define void @main() {
start:
  %a = call i8* @malloc(i32 4)
  store i8 32, i8* %a
  call void @puts(i8* getelementptr inbounds ([15 x i8]* @hasspace, i32 0, i32 0))
  ret void
}

The problem is that %a = malloc + store in %start, then there's the load inside
the loop body. GVN isn't fooled at all, but SCEV doesn't know what that load
will load.

Another possible fix would be adding a heap->stack transform then letting
mem2reg mop it up.
Quuxplusone commented 12 years ago
> The problem is that %a = malloc + store in %start, then there's the load
inside
> the loop body. GVN isn't fooled at all, but SCEV doesn't know what that load
> will load.
>

Yes, I had to add the malloc instruction to reproduce the above test case.

> Another possible fix would be adding a heap->stack transform then letting
> mem2reg mop it up.

That would fix the problem from comment 3, but the original test case is way
more complex.

opt -O3 -S gives me:

; ModuleID = 'x.ll'
target datalayout = "e-p:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-
f32:32:32-f64:32:64-f80:32:32-v64:64:64-v128:128:128-a0:0:64"

@hasspace = internal constant [15 x i8] c"it has a space\00"
@nospace = internal constant [16 x i8] c"it has no space\00"

declare void @puts(i8* nocapture) nounwind

declare noalias i8* @malloc(i32) nounwind

define void @main() nounwind {
start:
  %a = tail call i8* @malloc(i32 4)
  store i8 32, i8* %a, align 1
  br label %test

test:                                             ; preds = %body, %start
  %loopvar = phi i32 [ %loopvar2, %body ], [ 0, %start ]
  %finished = icmp sgt i32 %loopvar, 2
  br i1 %finished, label %continue, label %body

body:                                             ; preds = %test
  %loopvar2 = add i32 %loopvar, 1
  %otherbreak = icmp eq i32 %loopvar, 1
  br i1 %otherbreak, label %exception, label %test

continue:                                         ; preds = %test
  tail call void @puts(i8* getelementptr inbounds ([16 x i8]* @nospace, i64 0, i64 0))
  ret void

exception:                                        ; preds = %body
  tail call void @puts(i8* getelementptr inbounds ([15 x i8]* @hasspace, i64 0, i64 0))
  ret void
}

running opt a second time workarounds the issue. The problem is that the first
case already had two opt runs, so after running opt 4 times the correct result
was produced. Can't opt take a parameter to run itself as often as needed?