Incorrect OpenMP target offload code at > -O0 optimization

Quuxplusone commented 4 years ago


Bugzilla Link	PR44390
Status	NEW
Importance	P normal
Reported by	Christopher Daley (csdaley@lbl.gov)
Reported on	2019-12-27 13:01:44 -0800
Last modified on	2020-04-09 09:51:26 -0700
Version	unspecified
Hardware	PC Linux
CC	a.bataev@hotmail.com, cchen@cray.com, csdaley@lbl.gov, deepak.eachempati@hpe.com, hfinkel@anl.gov, jdoerfert@anl.gov, llvm-bugs@lists.llvm.org, maskray@google.com
Fixed by commit(s)
Attachments	`bug-report.tar.bz2` (13531 bytes, application/x-bzip2) `clang-compiler-output-jan-22-2020.tar.bz2` (495843 bytes, application/x-bzip2) `CLANG_BUG_44390.tar` (45056 bytes, application/x-tar) `smooth-debug-apply-op-ijk-modified.tar.gz` (19226 bytes, application/x-gzip)
Blocks
Blocked by
See also

Created attachment 22968
Source file, bitcode files and PTX files.

I have ported an application named HPGMG from CUDA to OpenMP target offload. I
have found that there are 4 functions containing OpenMP target offload which
give incorrect floating point results (outside of expected floating point
discrepancies) when using LLVM/Clang. The floating point results are only
expected when compiling with -O0 optimization. The host versions of the
functions are correct at -O1, -O2 and -O3 optimization levels (tested by
setting OMP_TARGET_OFFLOAD=disabled).

I have managed to isolate 1 expression in 1 of the 4 functions which produces
incorrect floating point results when using LLVM/Clang. I have placed this
expression in a function named apply_op_fn() in a separate file named smooth-
debug-apply-op-ijk.c. I only need to compile this one file at -O0 optimization
to give the correct results for this function. It seems that LLVM/Clang is
generating incorrect PTX for the function apply_op_fn(). I believe it is this
step because I only need to use -O0 optimization for the LLVM bitcode to PTX
compilation step to get the correct results.

I have attached a tarball containing the source file and the LLVM bitcode files
and PTX files at -O0 and -O1 optimization levels. Results are correct when
using -O0 and incorrect when using -O1. My test platform consists of Intel
Skylake CPUs and NVIDIA V100 GPUs. I have tested two versions of LLVM/Clang:
master branch from 23 December 2019 and 28 August 2019. Both have the same
issue. I have also tested the IBM XLC compiler. The IBM XLC compiler gives
expected results when compiling with either -O0 or -Ofast optimization levels.

I appreciate any help here. My expectation is that all 4 functions are affected
by the same compiler issue.
Thanks,
Chris

Quuxplusone commented 4 years ago

Attached bug-report.tar.bz2 (13531 bytes, application/x-bzip2): Source file, bitcode files and PTX files.

Quuxplusone commented 4 years ago

Did not quite understand what is the problem here. What result do you expect and what do you get with enabled offloading? Any additional info might help to identify the bug and fix it (if it is really a bug).

Quuxplusone commented 4 years ago

(In reply to Alexey Bataev from comment #1)

Did not quite understand what is the problem here. What result do you expect and what do you get with enabled offloading? Any additional info might help to identify the bug and fix it (if it is really a bug).

The offloading error is that the GPU kernel produces the wrong floating point results at -O1 or higher optimization levels. I have identified that the incorrect floating point results are from a macro expression named apply_op_ijk (please see the original attachment):

#define STENCIL_TWELFTH ( 0.0833333333333333333)  // 1.0/12.0;
#define apply_op_ijk(x)                                                                                                                            \
(                                                                                                                                                  \
 -b*h2inv*(                                                                                                                                        \
    STENCIL_TWELFTH*(                                                                                                                              \
      + beta_i[ijk        ]*( 15.0*(x[ijk-1      ]-x[ijk]) - (x[ijk-2        ]-x[ijk+1      ]) )                                                   \
      + beta_i[ijk+1      ]*( 15.0*(x[ijk+1      ]-x[ijk]) - (x[ijk+2        ]-x[ijk-1      ]) )                                                   \
      + beta_j[ijk        ]*( 15.0*(x[ijk-jStride]-x[ijk]) - (x[ijk-2*jStride]-x[ijk+jStride]) )                                                   \
      + beta_j[ijk+jStride]*( 15.0*(x[ijk+jStride]-x[ijk]) - (x[ijk+2*jStride]-x[ijk-jStride]) )                                                   \
      + beta_k[ijk        ]*( 15.0*(x[ijk-kStride]-x[ijk]) - (x[ijk-2*kStride]-x[ijk+kStride]) )                                                   \
      + beta_k[ijk+kStride]*( 15.0*(x[ijk+kStride]-x[ijk]) - (x[ijk+2*kStride]-x[ijk-kStride]) )                                                   \
    )                                                                                                                                              \
    + 0.25*STENCIL_TWELFTH*(                                                                                                                       \
      + (beta_i[ijk        +jStride]-beta_i[ijk        -jStride]) * (x[ijk-1      +jStride]-x[ijk+jStride]-x[ijk-1      -jStride]+x[ijk-jStride])  \
      + (beta_i[ijk        +kStride]-beta_i[ijk        -kStride]) * (x[ijk-1      +kStride]-x[ijk+kStride]-x[ijk-1      -kStride]+x[ijk-kStride])  \
      + (beta_j[ijk        +1      ]-beta_j[ijk        -1      ]) * (x[ijk-jStride+1      ]-x[ijk+1      ]-x[ijk-jStride-1      ]+x[ijk-1      ])  \
      + (beta_j[ijk        +kStride]-beta_j[ijk        -kStride]) * (x[ijk-jStride+kStride]-x[ijk+kStride]-x[ijk-jStride-kStride]+x[ijk-kStride])  \
      + (beta_k[ijk        +1      ]-beta_k[ijk        -1      ]) * (x[ijk-kStride+1      ]-x[ijk+1      ]-x[ijk-kStride-1      ]+x[ijk-1      ])  \
      + (beta_k[ijk        +jStride]-beta_k[ijk        -jStride]) * (x[ijk-kStride+jStride]-x[ijk+jStride]-x[ijk-kStride-jStride]+x[ijk-jStride])  \
                                                                                                                                                   \
      + (beta_i[ijk+1      +jStride]-beta_i[ijk+1      -jStride]) * (x[ijk+1      +jStride]-x[ijk+jStride]-x[ijk+1      -jStride]+x[ijk-jStride])  \
      + (beta_i[ijk+1      +kStride]-beta_i[ijk+1      -kStride]) * (x[ijk+1      +kStride]-x[ijk+kStride]-x[ijk+1      -kStride]+x[ijk-kStride])  \
      + (beta_j[ijk+jStride+1      ]-beta_j[ijk+jStride-1      ]) * (x[ijk+jStride+1      ]-x[ijk+1      ]-x[ijk+jStride-1      ]+x[ijk-1      ])  \
      + (beta_j[ijk+jStride+kStride]-beta_j[ijk+jStride-kStride]) * (x[ijk+jStride+kStride]-x[ijk+kStride]-x[ijk+jStride-kStride]+x[ijk-kStride])  \
      + (beta_k[ijk+kStride+1      ]-beta_k[ijk+kStride-1      ]) * (x[ijk+kStride+1      ]-x[ijk+1      ]-x[ijk+kStride-1      ]+x[ijk-1      ])  \
      + (beta_k[ijk+kStride+jStride]-beta_k[ijk+kStride-jStride]) * (x[ijk+kStride+jStride]-x[ijk+jStride]-x[ijk+kStride-jStride]+x[ijk-jStride])  \
    )                                                                                                                                              \
  )                                                                                                                                                \
)

The code is complicated and I cannot easily create a standalone test case for someone to check that the result of a given set of inputs is "X". That is why I sent the LLVM bitcode and PTX files for a function containing this expression at -O0 and -O1 optimization levels only. Perhaps a good place to start is for you to tell me what device-code transformations and corresponding compiler option names are enabled at -O1. I could then test the individual compiler options one by one and report back which specific transformation is causing the issue. Is this possible?

Thanks, Chris

Quuxplusone commented 4 years ago

(In reply to Christopher Daley from comment #2)

(In reply to Alexey Bataev from comment #1)

Did not quite understand what is the problem here. What result do you expect and what do you get with enabled offloading? Any additional info might help to identify the bug and fix it (if it is really a bug).

The offloading error is that the GPU kernel produces the wrong floating point results at -O1 or higher optimization levels. I have identified that the incorrect floating point results are from a macro expression named apply_op_ijk (please see the original attachment):

#define STENCIL_TWELFTH ( 0.0833333333333333333)  // 1.0/12.0;
#define apply_op_ijk(x)                                                     
\
(                                                                           
\
 -b*h2inv*(                                                                 
\
    STENCIL_TWELFTH*(                                                       
\
      + beta_i[ijk        ]*( 15.0*(x[ijk-1      ]-x[ijk]) - (x[ijk-2       
]-x[ijk+1      ]) )                                                   \
      + beta_i[ijk+1      ]*( 15.0*(x[ijk+1      ]-x[ijk]) - (x[ijk+2       
]-x[ijk-1      ]) )                                                   \
      + beta_j[ijk        ]*( 15.0*(x[ijk-jStride]-x[ijk]) -
(x[ijk-2*jStride]-x[ijk+jStride]) )                                         
\
      + beta_j[ijk+jStride]*( 15.0*(x[ijk+jStride]-x[ijk]) -
(x[ijk+2*jStride]-x[ijk-jStride]) )                                         
\
      + beta_k[ijk        ]*( 15.0*(x[ijk-kStride]-x[ijk]) -
(x[ijk-2*kStride]-x[ijk+kStride]) )                                         
\
      + beta_k[ijk+kStride]*( 15.0*(x[ijk+kStride]-x[ijk]) -
(x[ijk+2*kStride]-x[ijk-kStride]) )                                         
\
    )                                                                       
\
    + 0.25*STENCIL_TWELFTH*(                                                
\
      + (beta_i[ijk        +jStride]-beta_i[ijk        -jStride]) * (x[ijk-1
+jStride]-x[ijk+jStride]-x[ijk-1      -jStride]+x[ijk-jStride])  \
      + (beta_i[ijk        +kStride]-beta_i[ijk        -kStride]) * (x[ijk-1
+kStride]-x[ijk+kStride]-x[ijk-1      -kStride]+x[ijk-kStride])  \
      + (beta_j[ijk        +1      ]-beta_j[ijk        -1      ]) *
(x[ijk-jStride+1      ]-x[ijk+1      ]-x[ijk-jStride-1      ]+x[ijk-1     
])  \
      + (beta_j[ijk        +kStride]-beta_j[ijk        -kStride]) *
(x[ijk-jStride+kStride]-x[ijk+kStride]-x[ijk-jStride-kStride]+x[ijk-
kStride])  \
      + (beta_k[ijk        +1      ]-beta_k[ijk        -1      ]) *
(x[ijk-kStride+1      ]-x[ijk+1      ]-x[ijk-kStride-1      ]+x[ijk-1     
])  \
      + (beta_k[ijk        +jStride]-beta_k[ijk        -jStride]) *
(x[ijk-kStride+jStride]-x[ijk+jStride]-x[ijk-kStride-jStride]+x[ijk-
jStride])  \

\
      + (beta_i[ijk+1      +jStride]-beta_i[ijk+1      -jStride]) * (x[ijk+1
+jStride]-x[ijk+jStride]-x[ijk+1      -jStride]+x[ijk-jStride])  \
      + (beta_i[ijk+1      +kStride]-beta_i[ijk+1      -kStride]) * (x[ijk+1
+kStride]-x[ijk+kStride]-x[ijk+1      -kStride]+x[ijk-kStride])  \
      + (beta_j[ijk+jStride+1      ]-beta_j[ijk+jStride-1      ]) *
(x[ijk+jStride+1      ]-x[ijk+1      ]-x[ijk+jStride-1      ]+x[ijk-1     
])  \
      + (beta_j[ijk+jStride+kStride]-beta_j[ijk+jStride-kStride]) *
(x[ijk+jStride+kStride]-x[ijk+kStride]-x[ijk+jStride-kStride]+x[ijk-
kStride])  \
      + (beta_k[ijk+kStride+1      ]-beta_k[ijk+kStride-1      ]) *
(x[ijk+kStride+1      ]-x[ijk+1      ]-x[ijk+kStride-1      ]+x[ijk-1     
])  \
      + (beta_k[ijk+kStride+jStride]-beta_k[ijk+kStride-jStride]) *
(x[ijk+kStride+jStride]-x[ijk+jStride]-x[ijk+kStride-jStride]+x[ijk-
jStride])  \
    )                                                                       
\
  )                                                                         
\
)

The code is complicated and I cannot easily create a standalone test case for someone to check that the result of a given set of inputs is "X". That is why I sent the LLVM bitcode and PTX files for a function containing this expression at -O0 and -O1 optimization levels only. Perhaps a good place to start is for you to tell me what device-code transformations and corresponding compiler option names are enabled at -O1. I could then test the individual compiler options one by one and report back which specific transformation is causing the issue. Is this possible?

Thanks, Chris

What kind of incorrect result do you get? Completely wrong result or difference in last digits only?

Quuxplusone commented 4 years ago

(In reply to Alexey Bataev from comment #3)

(In reply to Christopher Daley from comment #2)

(In reply to Alexey Bataev from comment #1)

Did not quite understand what is the problem here. What result do you expect and what do you get with enabled offloading? Any additional info might help to identify the bug and fix it (if it is really a bug).

The offloading error is that the GPU kernel produces the wrong floating point results at -O1 or higher optimization levels. I have identified that the incorrect floating point results are from a macro expression named apply_op_ijk (please see the original attachment):

#define STENCIL_TWELFTH ( 0.0833333333333333333)  // 1.0/12.0;
#define apply_op_ijk(x)                                                     
\
(                                                                           
\
 -b*h2inv*(                                                                 
\
    STENCIL_TWELFTH*(                                                       
\
      + beta_i[ijk        ]*( 15.0*(x[ijk-1      ]-x[ijk]) - (x[ijk-2       
]-x[ijk+1      ]) )                                                   \
      + beta_i[ijk+1      ]*( 15.0*(x[ijk+1      ]-x[ijk]) - (x[ijk+2       
]-x[ijk-1      ]) )                                                   \
      + beta_j[ijk        ]*( 15.0*(x[ijk-jStride]-x[ijk]) -
(x[ijk-2*jStride]-x[ijk+jStride]) )                                         
\
      + beta_j[ijk+jStride]*( 15.0*(x[ijk+jStride]-x[ijk]) -
(x[ijk+2*jStride]-x[ijk-jStride]) )                                         
\
      + beta_k[ijk        ]*( 15.0*(x[ijk-kStride]-x[ijk]) -
(x[ijk-2*kStride]-x[ijk+kStride]) )                                         
\
      + beta_k[ijk+kStride]*( 15.0*(x[ijk+kStride]-x[ijk]) -
(x[ijk+2*kStride]-x[ijk-kStride]) )                                         
\
    )                                                                       
\
    + 0.25*STENCIL_TWELFTH*(                                                
\
      + (beta_i[ijk        +jStride]-beta_i[ijk        -jStride]) * (x[ijk-1
+jStride]-x[ijk+jStride]-x[ijk-1      -jStride]+x[ijk-jStride])  \
      + (beta_i[ijk        +kStride]-beta_i[ijk        -kStride]) * (x[ijk-1
+kStride]-x[ijk+kStride]-x[ijk-1      -kStride]+x[ijk-kStride])  \
      + (beta_j[ijk        +1      ]-beta_j[ijk        -1      ]) *
(x[ijk-jStride+1      ]-x[ijk+1      ]-x[ijk-jStride-1      ]+x[ijk-1     
])  \
      + (beta_j[ijk        +kStride]-beta_j[ijk        -kStride]) *
(x[ijk-jStride+kStride]-x[ijk+kStride]-x[ijk-jStride-kStride]+x[ijk-
kStride])  \
      + (beta_k[ijk        +1      ]-beta_k[ijk        -1      ]) *
(x[ijk-kStride+1      ]-x[ijk+1      ]-x[ijk-kStride-1      ]+x[ijk-1     
])  \
      + (beta_k[ijk        +jStride]-beta_k[ijk        -jStride]) *
(x[ijk-kStride+jStride]-x[ijk+jStride]-x[ijk-kStride-jStride]+x[ijk-
jStride])  \

\
      + (beta_i[ijk+1      +jStride]-beta_i[ijk+1      -jStride]) * (x[ijk+1
+jStride]-x[ijk+jStride]-x[ijk+1      -jStride]+x[ijk-jStride])  \
      + (beta_i[ijk+1      +kStride]-beta_i[ijk+1      -kStride]) * (x[ijk+1
+kStride]-x[ijk+kStride]-x[ijk+1      -kStride]+x[ijk-kStride])  \
      + (beta_j[ijk+jStride+1      ]-beta_j[ijk+jStride-1      ]) *
(x[ijk+jStride+1      ]-x[ijk+1      ]-x[ijk+jStride-1      ]+x[ijk-1     
])  \
      + (beta_j[ijk+jStride+kStride]-beta_j[ijk+jStride-kStride]) *
(x[ijk+jStride+kStride]-x[ijk+kStride]-x[ijk+jStride-kStride]+x[ijk-
kStride])  \
      + (beta_k[ijk+kStride+1      ]-beta_k[ijk+kStride-1      ]) *
(x[ijk+kStride+1      ]-x[ijk+1      ]-x[ijk+kStride-1      ]+x[ijk-1     
])  \
      + (beta_k[ijk+kStride+jStride]-beta_k[ijk+kStride-jStride]) *
(x[ijk+kStride+jStride]-x[ijk+jStride]-x[ijk+kStride-jStride]+x[ijk-
jStride])  \
    )                                                                       
\
  )                                                                         
\
)

The code is complicated and I cannot easily create a standalone test case for someone to check that the result of a given set of inputs is "X". That is why I sent the LLVM bitcode and PTX files for a function containing this expression at -O0 and -O1 optimization levels only. Perhaps a good place to start is for you to tell me what device-code transformations and corresponding compiler option names are enabled at -O1. I could then test the individual compiler options one by one and report back which specific transformation is causing the issue. Is this possible?

Thanks, Chris

What kind of incorrect result do you get? Completely wrong result or difference in last digits only?

When I printed every element in the array which got initialized by this expression, I found differences as high as 1E-8. All floating point values are double precision. Even though this may seem small, the end result is that the solver went from 4th order to 2nd order.

Quuxplusone commented 4 years ago

Try to compile with -fno-fast-math or -fno-unsafe-math-optimizations.

Quuxplusone commented 4 years ago

(In reply to Alexey Bataev from comment #5)
> Try to compile with -fno-fast-math or -fno-unsafe-math-optimizations.

Hi Alexey,

I have been using -fno-fast-math this entire time.

I just now tested this option in combination with -fno-unsafe-math-
optimizations. It has the same issue. I only get expected results when I add
the "-O0" flag to the end of my compilation line for smooth-debug-apply-op-ijk.c

clang -DUSE_BICGSTAB=1 -DUSE_SUBCOMM=1 -DUSE_FCYCLES=1 -DUSE_GSRB=1 -
DBLOCKCOPY_TILE_I=32 -DBLOCKCOPY_TILE_J=4 -DBLOCKCOPY_TILE_K=16 -
DBOUNDARY_TILE_I=64 -DBOUNDARY_TILE_J=16 -DBOUNDARY_TILE_K=16 -
DHOST_LEVEL_SIZE_THRESHOLD=10000 -DSPEC_OPENMP -DSPEC_OPENMP_TARGET -
DCLANG_BUG_44390 -DCUDA_UM_ALLOC -DCUDA_UM_ZERO_COPY -DMPI_ALLOC_ZERO_COPY -
DUSE_REG -DUSE_TEX -DUSE_CUDA -DSPEC_CUDA -O1 -fopenmp -std=gnu99 -fno-fast-
math -fno-unsafe-math-optimizations -I/usr/common/software/cuda/9.2.148/include
-fopenmp-targets=nvptx64-nvidia-cuda -O0 -c directives/smooth-debug-apply-op-
ijk.c -o smooth-debug-apply-op-ijk.o

For comparison. The host version of this function produces expected results at
all optimization levels. Also the device version of this function compiled with
the IBM XLC compiler produces expected results at all optimization levels,
including -Ofast.

Thanks,
Chris

Quuxplusone commented 4 years ago

Could you attach the .s file from XLC at O1, for example?

Quuxplusone commented 4 years ago

Attached clang-compiler-output-jan-22-2020.tar.bz2 (495843 bytes, application/x-bzip2): LLVM/Clang IR for function giving incorrect results

Quuxplusone commented 4 years ago

FWIW, we traced this down to a backend or NVPTX problem. Further investigation pending but so far I can rule out a frontend/middlend bug.

Quuxplusone commented 4 years ago

Attached CLANG_BUG_44390.tar (45056 bytes, application/x-tar): Standalone reproducer

Quuxplusone commented 4 years ago

I tried disabling SeparateConstOffsetFromGEPPass in NVPTXTargetMachine.cpp and in this way, I can get the test passed. However, I haven't found any issue in SeparateConstOffsetFromGEPPass. (SeparateConstOffsetFromGEPPass could create more opportunities for GVN, LICM, and CGP, but I haven't had luck finding any issue in those passes yet)

Quuxplusone commented 4 years ago

Do you have the IR before and after the pass? I can help looking at it to determine if that is the source of the error or not.

Quuxplusone commented 4 years ago

Attached smooth-debug-apply-op-ijk-modified.tar.gz (19226 bytes, application/x-gzip): Assembly of smooth-debug-apply-op-ijk

Quuxplusone commented 4 years ago

I'm a little puzzled here. The failure no longer happens with upstream
LLVM/Clang. It looks like the fix was in place in trunk by Feb 6 2020. The bug
still happens with the released version of LLVM/Clang-10.0.0.

trunk 20191223 with cuda/10.1.168 (79b3325be0b016fdc1a2c55bce65ec9f1e5f4eb6):
FAIL
10.0.0 release with cuda/10.1.243 (d32170dbd5b0d54436537b6b75beaf44324e0c28):
FAIL
trunk 20200206 with cuda/10.1.243 (80e17e5fcc09dc5baa940022e6988fcb08c5d92d):
SUCCESS
trunk 20200225 with cuda/10.1.243 (63cef621f954eb87c494021725f4eeac89132d16):
SUCCESS
trunk 20200313 with cuda/10.1.243 (20e36f31dfc1bb079dc6e6db5f692a4e90aa0c9d):
SUCCESS
trunk 20200407 with cuda/10.1.243 (f85ae058f580e9d74c4a8f2f0de168c18da6150f):
SUCCESS.

Has there been a known fix or is the bug silently hidden because the compiler
generates slightly different LLVM IR today?

Quuxplusone / LLVMBugzillaTest

Incorrect OpenMP target offload code at > -O0 optimization #43360