[MLIR] test failure on SystemZ

Quuxplusone commented 3 years ago


Bugzilla Link	PR51204
Status	NEW
Importance	P normal
Reported by	Jonas Paulsson (paulsson@linux.vnet.ibm.com)
Reported on	2021-07-24 06:28:14 -0700
Last modified on	2021-10-19 18:08:51 -0700
Version	unspecified
Hardware	PC Linux
CC	ataei@google.com, ezhulenev@google.com, joker.eph@gmail.com, jpienaar@google.com, paulsson@linux.vnet.ibm.com, riddleriver@gmail.com, uweigand@de.ibm.com
Fixed by commit(s)
Attachments	`tc_preinl.ll` (3648 bytes, text/plain)
Blocks
Blocked by
See also

It seems that ea7f211 (https://reviews.llvm.org/D97599) has unfortunately been
failing on SystemZ ever since it was committed, since there has not been a
build-bot setup for MLIR on SystemZ yet. We intend to do that now as soon as
this error is resolved...

ea7f211 is the first commit where mlir-cpu-runner/math_polynomial_approx.mlir
is failing. It does so in @exp() which was added by this commit. The last two
CHECK:s fail: nan, nan is produced instead of inf, 0.

I built this with latest (trunk) clang with

cmake -G Ninja -DLLVM_ENABLE_PROJECTS="mlir" -DCMAKE_BUILD_TYPE="Release" -
DLLVM_ENABLE_ASSERTIONS=On -DLLVM_BUILD_EXAMPLES=ON -DCMAKE_C_COMPILER=clang -
DCMAKE_CXX_COMPILER=clang++ -DLLVM_TARGETS_TO_BUILD=SystemZ -
DLLVM_ENABLE_RTTI=ON $LLVM_SRC/llvm

cmake --build . --target check-mlir

Failed Tests (4):
  MLIR :: mlir-cpu-runner/async-group.mlir
  MLIR :: mlir-cpu-runner/async-value.mlir
  MLIR :: mlir-cpu-runner/async.mlir
  MLIR :: mlir-cpu-runner/math_polynomial_approx.mlir

The three first failures have disapeared from trunk, so now it's only this one
left to handle. It would be very nice to get some help with debugging this
since I am not familiar with this area myself, thanks.

Quuxplusone commented 3 years ago

ping!

Quuxplusone commented 3 years ago

Could you perhaps disable these on SystemZ? That way it would allow you to set up the bot (and avoid further regressions).

Quuxplusone commented 3 years ago

Yes I'd suggest the same, let's mark this test as XFAIL on system-Z and add the bot, so that we avoid further regressions/failures on this platform.

Quuxplusone commented 3 years ago

@Jonas: Did you look into setting it up such? A current change (https://reviews.llvm.org/D107896) could effect it (created for different reason and not focused on this) but without build bot I don't think we'd notice a difference.

Quuxplusone commented 3 years ago

Sorry for the delay on this... we will try to set up the build-bot soon, while
also trying to understand this test failure.

I found that with this function (taking from the failing test):

func @main() {
  // CHECK: inf
  %inf = constant 0x7f800000 : f32
  %exp_inf = math.exp %inf : f32
  vector.print %exp_inf : f32
  return
}

I get '0' printed, but if I disable the inliner I get 'inf' as expected:

mlir-opt tc.mlir -test-math-polynomial-approximation -convert-vector-to-llvm -
convert-std-to-llvm | bin/mlir-cpu-runner -e main -entry-point-result=void -O0 -
shared-libs=lib/libmlir_c_runner_utils.so -shared-
libs=lib/libmlir_runner_utils.\
so
0

bin/mlir-opt tc.mlir -test-math-polynomial-approximation -convert-vector-to-
llvm -convert-std-to-llvm | bin/mlir-cpu-runner -e main -entry-point-
result=void -O0 -shared-libs=lib/libmlir_c_runner_utils.so -shared-
libs=lib/libmlir_runner_ut\
ils.so -inline-threshold=0
inf

This is what I see the inliner doing:

define void @main() !dbg !3 {
  %1 = call float @llvm.floor.f32(float 0x7FF0000000000000), !dbg !7
  %2 = fmul float %1, 0x3FE62E4300000000, !dbg !9
  %3 = fsub float 0x7FF0000000000000, %2, !dbg !10
  %4 = fmul float %3, %3, !dbg !11
  %5 = fmul float %4, %4, !dbg !12
  %6 = call float @llvm.fma.f32(float 1.000000e+00, float %3, float 1.000000e+00), !dbg !13
  %7 = call float @llvm.fma.f32(float 0x3FC5993C80000000, float %3, float 0x3FDFFB2B40000000), !dbg !14
  %8 = call float @llvm.fma.f32(float 0x3F8AEAFAC0000000, float %3, float 0x3FA2C8FC60000000), !dbg !15
  %9 = call float @llvm.fma.f32(float %7, float %4, float %6), !dbg !16
  %10 = call float @llvm.fma.f32(float %8, float %5, float %9), !dbg !17
  %11 = fptosi float %1 to i32, !dbg !18
  %12 = add i32 %11, 127, !dbg !19
  %13 = shl i32 %12, 23, !dbg !20
  %14 = bitcast i32 %13 to float, !dbg !21
  %15 = fmul float %10, %14, !dbg !22
  %16 = icmp sle i32 %11, 127, !dbg !23
  %17 = icmp sge i32 %11, -127, !dbg !24
  %18 = and i1 %16, %17, !dbg !25
  %19 = select i1 %18, float %15, float 0x7FF0000000000000, !dbg !26
  call void @printF32(float %19), !dbg !27
  call void @printNewline(), !dbg !28
  ret void, !dbg !29
}

define void @_mlir_main(i8** %0) {
  call void @main()
  ret void
}

=>

*** IR Dump After Function Integration/Inlining (inline) ***
Printing <null> Function
*** IR Dump After Function Integration/Inlining (inline) ***define void @main()
!dbg !3 {
  %1 = call float @llvm.floor.f32(float 0x7FF0000000000000), !dbg !7
  %2 = fmul float %1, 0x3FE62E4300000000, !dbg !9
  %3 = fsub float 0x7FF0000000000000, %2, !dbg !10
  %4 = fmul float %3, %3, !dbg !11
  %5 = fmul float %4, %4, !dbg !12
  %6 = call float @llvm.fma.f32(float 1.000000e+00, float %3, float 1.000000e+00), !dbg !13
  %7 = call float @llvm.fma.f32(float 0x3FC5993C80000000, float %3, float 0x3FDFFB2B40000000), !dbg !14
  %8 = call float @llvm.fma.f32(float 0x3F8AEAFAC0000000, float %3, float 0x3FA2C8FC60000000), !dbg !15
  %9 = call float @llvm.fma.f32(float %7, float %4, float %6), !dbg !16
  %10 = call float @llvm.fma.f32(float %8, float %5, float %9), !dbg !17
  %11 = fptosi float %1 to i32, !dbg !18
  %12 = add i32 %11, 127, !dbg !19
  %13 = shl i32 %12, 23, !dbg !20
  %14 = bitcast i32 %13 to float, !dbg !21
  %15 = fmul float %10, %14, !dbg !22
  %16 = icmp sle i32 %11, 127, !dbg !23
  %17 = icmp sge i32 %11, -127, !dbg !24
  %18 = and i1 %16, %17, !dbg !25
  %19 = select i1 %18, float %15, float 0x7FF0000000000000, !dbg !26
  call void @printF32(float %19), !dbg !27
  call void @printNewline(), !dbg !28
  ret void, !dbg !29
}
*** IR Dump After Function Integration/Inlining (inline) ***define void
@_mlir_main(i8** %0) {
  call void @printF32(float poison), !dbg !3
  call void @printNewline(), !dbg !9
  ret void
}

So it seems that during inlining, the computations in main is for some reason
replaced with a poison value. So the '0' is really just poison, not '0'
necessarily, it seems.

Does anyone know what could be the problem here?

Quuxplusone commented 3 years ago

Can you post the entire .ll file and the opt command to reproduce? (the .ll file will include the SystemZ target triple / data layout, which I won't get if I run cmake --build . --target check-mlir on my system).

Quuxplusone commented 3 years ago

(In reply to Mehdi Amini from comment #6)

Can you post the entire .ll file and the opt command to reproduce? (the .ll file will include the SystemZ target triple / data layout, which I won't get if I run cmake --build . --target check-mlir on my system).

Not sure how to dump the .ll file with mlir-cpu-runner...?

The output from the first command above (mlir-opt) was:

builtin.module attributes {llvm.data_layout = ""} { llvm.func @printNewline() llvm.func @printF32(f32) llvm.func @main() { %0 = llvm.mlir.constant(0.693147182 : f32) : f32 %1 = llvm.mlir.constant(1.000000e+00 : f32) : f32 %2 = llvm.mlir.constant(0.499705136 : f32) : f32 %3 = llvm.mlir.constant(0.168738902 : f32) : f32 %4 = llvm.mlir.constant(0.0366896503 : f32) : f32 %5 = llvm.mlir.constant(1.314350e-02 : f32) : f32 %6 = llvm.mlir.constant(23 : i32) : i32 %7 = llvm.mlir.constant(0x7F800000 : f32) : f32 %8 = llvm.mlir.constant(127 : i32) : i32 %9 = llvm.mlir.constant(-127 : i32) : i32 %10 = "llvm.intr.floor"(%7) : (f32) -> f32 %11 = llvm.fmul %10, %0 : f32 %12 = llvm.fsub %7, %11 : f32 %13 = llvm.fmul %12, %12 : f32 %14 = llvm.fmul %13, %13 : f32 %15 = "llvm.intr.fma"(%1, %12, %1) : (f32, f32, f32) -> f32 %16 = "llvm.intr.fma"(%3, %12, %2) : (f32, f32, f32) -> f32 %17 = "llvm.intr.fma"(%5, %12, %4) : (f32, f32, f32) -> f32 %18 = "llvm.intr.fma"(%16, %13, %15) : (f32, f32, f32) -> f32 %19 = "llvm.intr.fma"(%17, %14, %18) : (f32, f32, f32) -> f32 %20 = llvm.fptosi %10 : f32 to i32 %21 = llvm.add %20, %8 : i32 %22 = llvm.shl %21, %6 : i32 %23 = llvm.bitcast %22 : i32 to f32 %24 = llvm.fmul %19, %23 : f32 %25 = llvm.icmp "sle" %20, %8 : i32 %26 = llvm.icmp "sge" %20, %9 : i32 %27 = llvm.and %25, %26 : i1 %28 = llvm.select %27, %24, %7 : i1, f32 llvm.call @printF32(%28) : (f32) -> () llvm.call @printNewline() : () -> () llvm.return } }

Does that help you?

Quuxplusone commented 3 years ago

Attached tc_preinl.ll (3648 bytes, text/plain): reduced test case for inliner

Quuxplusone commented 3 years ago

It seems that when I run this test case on SystemZ with opt and inliner the
result is always inlined to poison even when passing x86/powerPC target triples.

In gdb I see InstCombiner (from CloneFunction.cpp:394) simplifying the
instructions like:

  %1 = call float @llvm.floor.f32(float 0x7FF0000000000000)
     => 0x7FF0000000000000 +Inf
  %2 = fmul float %1, 0x3FE62E4300000000
     => 0x7FF0000000000000 +Inf
  %3 = fsub float 0x7FF0000000000000, %2
     => 0x7FF8000000000000 NaN
  %4 = fmul float %3, %3
     => 0x7FF8000000000000 NaN
  %5 = fmul float %4, %4
     => 0x7FF8000000000000 NaN
  %6 = call float @llvm.fma.f32()
     => 0x7FF8000000000000 NaN
  %7 = call float @llvm.fma.f32()
     => 0x7FF8000000000000 NaN
  %8 = call float @llvm.fma.f32()
     => 0x7FF8000000000000 NaN
  %9 = call float @llvm.fma.f32()
     => 0x7FF8000000000000 NaN
  %10 = call float @llvm.fma.f32()
     => 0x7FF8000000000000 NaN
  %11 = fptosi float %1 to i32                              => poison
  %12 = add i32 %11, 127                                    => poison
  %13 = shl i32 %12, 23                                     => poison
  %14 = bitcast i32 %13 to float                            => float poison
  %15 = fmul float %10, %14                                 => float poison
  %16 = icmp sle i32 %11, 127                               => poison
  %17 = icmp sge i32 %11, -127                              => poison
  %18 = and i1 %16, %17                                     => poison
  %19 = select i1 %18, float %15, float 0x7FF0000000000000  => float poison
  call void @printF32(float %19)
    => printF32(float poison)

- The subtraction of +Inf with +Inf resulting in NaN surprised me just a
little...
- The fptosi from +Inf results in poison.
- %19 is a select betwen float poison and +Inf. The i1 input is poison, and the
expected result of the test case is +Inf. For some reason InstCombiner chooses
the LHS.

I am not sure if any mathematical result here is invalid or not... Could
something be target specific?

Quuxplusone commented 3 years ago

I have now built this test case also on my x86 laptop, and it seems like there
is something wrong in the first place with ExpApproximation::matchAndRewrite().

This function (same as before):

func @main() {
  // CHECK: inf
  %inf = constant 0x7f800000 : f32
  %exp_inf = math.exp %inf : f32
  vector.print %exp_inf : f32
  return
}

Is expanded my by mlir-opt (ExpApproximation) to a sequence which seems
suspicious - it contains:

    ...
    %7 = llvm.mlir.constant(0x7F800000 : f32) : f32   // +Inf
    ...
    %10 = "llvm.intr.floor"(%7) : (f32) -> f32        // +Inf
    ...
    %20 = llvm.fptosi %10 : f32 to i32                // poison!
    %21 = llvm.add %20, %8  : i32
    %22 = llvm.shl %21, %6  : i32
    %23 = llvm.bitcast %22 : i32 to f32
    %24 = llvm.fmul %19, %23  : f32
    %25 = llvm.icmp "sle" %20, %8 : i32
    %26 = llvm.icmp "sge" %20, %9 : i32
    %27 = llvm.and %25, %26  : i1
    %28 = llvm.select %27, %24, %7 : i1, f32
    llvm.call @printF32(%28) : (f32) -> ()

So also on x86 I see:

*** IR Dump After Function Integration/Inlining (inline) ***define void
@_mlir_main(i8** %0) {
  call void @printF32(float poison), !dbg !3
  call void @printNewline(), !dbg !9
  ret void
}

# *** IR Dump After X86 Speculative Execution Side Effect Suppression (x86-
seses) ***:
# Machine code for function _mlir_main: NoPHIs, TracksLiveness, NoVRegs,
TiedOpsRewritten
Function Live Ins: $rdi

bb.0 (%ir-block.1):
  liveins: $rdi
  frame-setup PUSH64r undef $rax, implicit-def $rsp, implicit $rsp
  CFI_INSTRUCTION def_cfa_offset 16
  renamable $rax = MOV64ri @printF32, debug-location !27; <stdin>:34:5
  renamable $xmm0 = IMPLICIT_DEF
  CALL64r killed renamable $rax, <regmask ...
  renamable $rax = MOV64ri @printNewline, debug-location !28; <stdin>:35:5
  CALL64r killed renamable $rax, <regmask ...
  $rax = frame-destroy POP64r implicit-def $rsp, implicit $rsp
  CFI_INSTRUCTION def_cfa_offset 8
  RETQ

So at this point it looks like this by chance prints inf on x86, while it
prints 0 on SystemZ (and without inlining on SystemZ it by chance printed
inf...).

The ExpApproximation of a constant infinity seems to involve an fptosi of +Inf,
which results in poison, so it looks like there is a bug here, or?

Quuxplusone commented 2 years ago

ping!

The SystemZ test has been marked XFAIL, but we should still try to understand the underlying problem before closing this...

Quuxplusone commented 2 years ago

Sorry I wasn't getting notification on this, Can you post the ll output with/without inliner?

I think what is happening is related to some target specific +inf arithmetic which we can avoid.

Quuxplusone commented 2 years ago

(In reply to Ahmed S. Taei from comment #12)
> Sorry I wasn't getting notification on this, Can you post the ll output
> with/without inliner?
>
> I think what is happening is related to some target specific +inf arithmetic
> which we can avoid.

The test case I posted in comment #8 is the actual input to the inliner. The
function containing the fp instructions (@main) is inlined into @mlir_main()
into just a poison value. So, with the inliner, a poison value is printed,
without inlining, the function is called as it is (the "ll output without
inliner" is identical to the function as shown in the test case)

As I wrote in following comments after that, this seems to fail in the same way
on both X86 and SystemZ, but by chance it prints the right value some times
(it's printing an undef).

I still think it's the fp -> SInt conversion of infinity that stands out (see
comment #9)... If this would be target dependent as you said, could this be the
point of the problem?

I see that if I run 'opt early-cse' on the test case (tc_preinl.ll), the
@main() function is also reduced to just a poison value. So it seems that the
function is potentially optimized into a poison value, but if that optimization
never happens, the fp instructions remain and the SystemZ backend emits
instructions (including the fp->sint conversion) and prints the expected
result...

Quuxplusone commented 2 years ago

@Ahmed: Did my reply help you understand the problem or is there anything else you need from me?

Quuxplusone commented 2 years ago

Right, I can change the approximation to avoid any arithmetic on +inf/-inf. I Will send a diff you can try on SystemZ from your side.

Quuxplusone commented 2 years ago

PTL, https://reviews.llvm.org/D112115

Quuxplusone / LLVMBugzillaTest

[MLIR] test failure on SystemZ #50173