Open Quuxplusone opened 4 years ago
Attached bug-report.tar.bz2
(13531 bytes, application/x-bzip2): Source file, bitcode files and PTX files.
Did not quite understand what is the problem here. What result do you expect and what do you get with enabled offloading? Any additional info might help to identify the bug and fix it (if it is really a bug).
(In reply to Alexey Bataev from comment #1)
Did not quite understand what is the problem here. What result do you expect and what do you get with enabled offloading? Any additional info might help to identify the bug and fix it (if it is really a bug).
The offloading error is that the GPU kernel produces the wrong floating point results at -O1 or higher optimization levels. I have identified that the incorrect floating point results are from a macro expression named apply_op_ijk (please see the original attachment):
#define STENCIL_TWELFTH ( 0.0833333333333333333) // 1.0/12.0;
#define apply_op_ijk(x) \
( \
-b*h2inv*( \
STENCIL_TWELFTH*( \
+ beta_i[ijk ]*( 15.0*(x[ijk-1 ]-x[ijk]) - (x[ijk-2 ]-x[ijk+1 ]) ) \
+ beta_i[ijk+1 ]*( 15.0*(x[ijk+1 ]-x[ijk]) - (x[ijk+2 ]-x[ijk-1 ]) ) \
+ beta_j[ijk ]*( 15.0*(x[ijk-jStride]-x[ijk]) - (x[ijk-2*jStride]-x[ijk+jStride]) ) \
+ beta_j[ijk+jStride]*( 15.0*(x[ijk+jStride]-x[ijk]) - (x[ijk+2*jStride]-x[ijk-jStride]) ) \
+ beta_k[ijk ]*( 15.0*(x[ijk-kStride]-x[ijk]) - (x[ijk-2*kStride]-x[ijk+kStride]) ) \
+ beta_k[ijk+kStride]*( 15.0*(x[ijk+kStride]-x[ijk]) - (x[ijk+2*kStride]-x[ijk-kStride]) ) \
) \
+ 0.25*STENCIL_TWELFTH*( \
+ (beta_i[ijk +jStride]-beta_i[ijk -jStride]) * (x[ijk-1 +jStride]-x[ijk+jStride]-x[ijk-1 -jStride]+x[ijk-jStride]) \
+ (beta_i[ijk +kStride]-beta_i[ijk -kStride]) * (x[ijk-1 +kStride]-x[ijk+kStride]-x[ijk-1 -kStride]+x[ijk-kStride]) \
+ (beta_j[ijk +1 ]-beta_j[ijk -1 ]) * (x[ijk-jStride+1 ]-x[ijk+1 ]-x[ijk-jStride-1 ]+x[ijk-1 ]) \
+ (beta_j[ijk +kStride]-beta_j[ijk -kStride]) * (x[ijk-jStride+kStride]-x[ijk+kStride]-x[ijk-jStride-kStride]+x[ijk-kStride]) \
+ (beta_k[ijk +1 ]-beta_k[ijk -1 ]) * (x[ijk-kStride+1 ]-x[ijk+1 ]-x[ijk-kStride-1 ]+x[ijk-1 ]) \
+ (beta_k[ijk +jStride]-beta_k[ijk -jStride]) * (x[ijk-kStride+jStride]-x[ijk+jStride]-x[ijk-kStride-jStride]+x[ijk-jStride]) \
\
+ (beta_i[ijk+1 +jStride]-beta_i[ijk+1 -jStride]) * (x[ijk+1 +jStride]-x[ijk+jStride]-x[ijk+1 -jStride]+x[ijk-jStride]) \
+ (beta_i[ijk+1 +kStride]-beta_i[ijk+1 -kStride]) * (x[ijk+1 +kStride]-x[ijk+kStride]-x[ijk+1 -kStride]+x[ijk-kStride]) \
+ (beta_j[ijk+jStride+1 ]-beta_j[ijk+jStride-1 ]) * (x[ijk+jStride+1 ]-x[ijk+1 ]-x[ijk+jStride-1 ]+x[ijk-1 ]) \
+ (beta_j[ijk+jStride+kStride]-beta_j[ijk+jStride-kStride]) * (x[ijk+jStride+kStride]-x[ijk+kStride]-x[ijk+jStride-kStride]+x[ijk-kStride]) \
+ (beta_k[ijk+kStride+1 ]-beta_k[ijk+kStride-1 ]) * (x[ijk+kStride+1 ]-x[ijk+1 ]-x[ijk+kStride-1 ]+x[ijk-1 ]) \
+ (beta_k[ijk+kStride+jStride]-beta_k[ijk+kStride-jStride]) * (x[ijk+kStride+jStride]-x[ijk+jStride]-x[ijk+kStride-jStride]+x[ijk-jStride]) \
) \
) \
)
The code is complicated and I cannot easily create a standalone test case for someone to check that the result of a given set of inputs is "X". That is why I sent the LLVM bitcode and PTX files for a function containing this expression at -O0 and -O1 optimization levels only. Perhaps a good place to start is for you to tell me what device-code transformations and corresponding compiler option names are enabled at -O1. I could then test the individual compiler options one by one and report back which specific transformation is causing the issue. Is this possible?
Thanks, Chris
(In reply to Christopher Daley from comment #2)
(In reply to Alexey Bataev from comment #1)
Did not quite understand what is the problem here. What result do you expect and what do you get with enabled offloading? Any additional info might help to identify the bug and fix it (if it is really a bug).
The offloading error is that the GPU kernel produces the wrong floating point results at -O1 or higher optimization levels. I have identified that the incorrect floating point results are from a macro expression named apply_op_ijk (please see the original attachment):
#define STENCIL_TWELFTH ( 0.0833333333333333333) // 1.0/12.0; #define apply_op_ijk(x) \ ( \ -b*h2inv*( \ STENCIL_TWELFTH*( \ + beta_i[ijk ]*( 15.0*(x[ijk-1 ]-x[ijk]) - (x[ijk-2 ]-x[ijk+1 ]) ) \ + beta_i[ijk+1 ]*( 15.0*(x[ijk+1 ]-x[ijk]) - (x[ijk+2 ]-x[ijk-1 ]) ) \ + beta_j[ijk ]*( 15.0*(x[ijk-jStride]-x[ijk]) - (x[ijk-2*jStride]-x[ijk+jStride]) ) \ + beta_j[ijk+jStride]*( 15.0*(x[ijk+jStride]-x[ijk]) - (x[ijk+2*jStride]-x[ijk-jStride]) ) \ + beta_k[ijk ]*( 15.0*(x[ijk-kStride]-x[ijk]) - (x[ijk-2*kStride]-x[ijk+kStride]) ) \ + beta_k[ijk+kStride]*( 15.0*(x[ijk+kStride]-x[ijk]) - (x[ijk+2*kStride]-x[ijk-kStride]) ) \ ) \ + 0.25*STENCIL_TWELFTH*( \ + (beta_i[ijk +jStride]-beta_i[ijk -jStride]) * (x[ijk-1 +jStride]-x[ijk+jStride]-x[ijk-1 -jStride]+x[ijk-jStride]) \ + (beta_i[ijk +kStride]-beta_i[ijk -kStride]) * (x[ijk-1 +kStride]-x[ijk+kStride]-x[ijk-1 -kStride]+x[ijk-kStride]) \ + (beta_j[ijk +1 ]-beta_j[ijk -1 ]) * (x[ijk-jStride+1 ]-x[ijk+1 ]-x[ijk-jStride-1 ]+x[ijk-1 ]) \ + (beta_j[ijk +kStride]-beta_j[ijk -kStride]) * (x[ijk-jStride+kStride]-x[ijk+kStride]-x[ijk-jStride-kStride]+x[ijk- kStride]) \ + (beta_k[ijk +1 ]-beta_k[ijk -1 ]) * (x[ijk-kStride+1 ]-x[ijk+1 ]-x[ijk-kStride-1 ]+x[ijk-1 ]) \ + (beta_k[ijk +jStride]-beta_k[ijk -jStride]) * (x[ijk-kStride+jStride]-x[ijk+jStride]-x[ijk-kStride-jStride]+x[ijk- jStride]) \ \ + (beta_i[ijk+1 +jStride]-beta_i[ijk+1 -jStride]) * (x[ijk+1 +jStride]-x[ijk+jStride]-x[ijk+1 -jStride]+x[ijk-jStride]) \ + (beta_i[ijk+1 +kStride]-beta_i[ijk+1 -kStride]) * (x[ijk+1 +kStride]-x[ijk+kStride]-x[ijk+1 -kStride]+x[ijk-kStride]) \ + (beta_j[ijk+jStride+1 ]-beta_j[ijk+jStride-1 ]) * (x[ijk+jStride+1 ]-x[ijk+1 ]-x[ijk+jStride-1 ]+x[ijk-1 ]) \ + (beta_j[ijk+jStride+kStride]-beta_j[ijk+jStride-kStride]) * (x[ijk+jStride+kStride]-x[ijk+kStride]-x[ijk+jStride-kStride]+x[ijk- kStride]) \ + (beta_k[ijk+kStride+1 ]-beta_k[ijk+kStride-1 ]) * (x[ijk+kStride+1 ]-x[ijk+1 ]-x[ijk+kStride-1 ]+x[ijk-1 ]) \ + (beta_k[ijk+kStride+jStride]-beta_k[ijk+kStride-jStride]) * (x[ijk+kStride+jStride]-x[ijk+jStride]-x[ijk+kStride-jStride]+x[ijk- jStride]) \ ) \ ) \ )
The code is complicated and I cannot easily create a standalone test case for someone to check that the result of a given set of inputs is "X". That is why I sent the LLVM bitcode and PTX files for a function containing this expression at -O0 and -O1 optimization levels only. Perhaps a good place to start is for you to tell me what device-code transformations and corresponding compiler option names are enabled at -O1. I could then test the individual compiler options one by one and report back which specific transformation is causing the issue. Is this possible?
Thanks, Chris
What kind of incorrect result do you get? Completely wrong result or difference in last digits only?
(In reply to Alexey Bataev from comment #3)
(In reply to Christopher Daley from comment #2)
(In reply to Alexey Bataev from comment #1)
Did not quite understand what is the problem here. What result do you expect and what do you get with enabled offloading? Any additional info might help to identify the bug and fix it (if it is really a bug).
The offloading error is that the GPU kernel produces the wrong floating point results at -O1 or higher optimization levels. I have identified that the incorrect floating point results are from a macro expression named apply_op_ijk (please see the original attachment):
#define STENCIL_TWELFTH ( 0.0833333333333333333) // 1.0/12.0; #define apply_op_ijk(x) \ ( \ -b*h2inv*( \ STENCIL_TWELFTH*( \ + beta_i[ijk ]*( 15.0*(x[ijk-1 ]-x[ijk]) - (x[ijk-2 ]-x[ijk+1 ]) ) \ + beta_i[ijk+1 ]*( 15.0*(x[ijk+1 ]-x[ijk]) - (x[ijk+2 ]-x[ijk-1 ]) ) \ + beta_j[ijk ]*( 15.0*(x[ijk-jStride]-x[ijk]) - (x[ijk-2*jStride]-x[ijk+jStride]) ) \ + beta_j[ijk+jStride]*( 15.0*(x[ijk+jStride]-x[ijk]) - (x[ijk+2*jStride]-x[ijk-jStride]) ) \ + beta_k[ijk ]*( 15.0*(x[ijk-kStride]-x[ijk]) - (x[ijk-2*kStride]-x[ijk+kStride]) ) \ + beta_k[ijk+kStride]*( 15.0*(x[ijk+kStride]-x[ijk]) - (x[ijk+2*kStride]-x[ijk-kStride]) ) \ ) \ + 0.25*STENCIL_TWELFTH*( \ + (beta_i[ijk +jStride]-beta_i[ijk -jStride]) * (x[ijk-1 +jStride]-x[ijk+jStride]-x[ijk-1 -jStride]+x[ijk-jStride]) \ + (beta_i[ijk +kStride]-beta_i[ijk -kStride]) * (x[ijk-1 +kStride]-x[ijk+kStride]-x[ijk-1 -kStride]+x[ijk-kStride]) \ + (beta_j[ijk +1 ]-beta_j[ijk -1 ]) * (x[ijk-jStride+1 ]-x[ijk+1 ]-x[ijk-jStride-1 ]+x[ijk-1 ]) \ + (beta_j[ijk +kStride]-beta_j[ijk -kStride]) * (x[ijk-jStride+kStride]-x[ijk+kStride]-x[ijk-jStride-kStride]+x[ijk- kStride]) \ + (beta_k[ijk +1 ]-beta_k[ijk -1 ]) * (x[ijk-kStride+1 ]-x[ijk+1 ]-x[ijk-kStride-1 ]+x[ijk-1 ]) \ + (beta_k[ijk +jStride]-beta_k[ijk -jStride]) * (x[ijk-kStride+jStride]-x[ijk+jStride]-x[ijk-kStride-jStride]+x[ijk- jStride]) \ \ + (beta_i[ijk+1 +jStride]-beta_i[ijk+1 -jStride]) * (x[ijk+1 +jStride]-x[ijk+jStride]-x[ijk+1 -jStride]+x[ijk-jStride]) \ + (beta_i[ijk+1 +kStride]-beta_i[ijk+1 -kStride]) * (x[ijk+1 +kStride]-x[ijk+kStride]-x[ijk+1 -kStride]+x[ijk-kStride]) \ + (beta_j[ijk+jStride+1 ]-beta_j[ijk+jStride-1 ]) * (x[ijk+jStride+1 ]-x[ijk+1 ]-x[ijk+jStride-1 ]+x[ijk-1 ]) \ + (beta_j[ijk+jStride+kStride]-beta_j[ijk+jStride-kStride]) * (x[ijk+jStride+kStride]-x[ijk+kStride]-x[ijk+jStride-kStride]+x[ijk- kStride]) \ + (beta_k[ijk+kStride+1 ]-beta_k[ijk+kStride-1 ]) * (x[ijk+kStride+1 ]-x[ijk+1 ]-x[ijk+kStride-1 ]+x[ijk-1 ]) \ + (beta_k[ijk+kStride+jStride]-beta_k[ijk+kStride-jStride]) * (x[ijk+kStride+jStride]-x[ijk+jStride]-x[ijk+kStride-jStride]+x[ijk- jStride]) \ ) \ ) \ )
The code is complicated and I cannot easily create a standalone test case for someone to check that the result of a given set of inputs is "X". That is why I sent the LLVM bitcode and PTX files for a function containing this expression at -O0 and -O1 optimization levels only. Perhaps a good place to start is for you to tell me what device-code transformations and corresponding compiler option names are enabled at -O1. I could then test the individual compiler options one by one and report back which specific transformation is causing the issue. Is this possible?
Thanks, Chris
What kind of incorrect result do you get? Completely wrong result or difference in last digits only?
When I printed every element in the array which got initialized by this expression, I found differences as high as 1E-8. All floating point values are double precision. Even though this may seem small, the end result is that the solver went from 4th order to 2nd order.
Try to compile with -fno-fast-math or -fno-unsafe-math-optimizations.
(In reply to Alexey Bataev from comment #5)
> Try to compile with -fno-fast-math or -fno-unsafe-math-optimizations.
Hi Alexey,
I have been using -fno-fast-math this entire time.
I just now tested this option in combination with -fno-unsafe-math-
optimizations. It has the same issue. I only get expected results when I add
the "-O0" flag to the end of my compilation line for smooth-debug-apply-op-ijk.c
clang -DUSE_BICGSTAB=1 -DUSE_SUBCOMM=1 -DUSE_FCYCLES=1 -DUSE_GSRB=1 -
DBLOCKCOPY_TILE_I=32 -DBLOCKCOPY_TILE_J=4 -DBLOCKCOPY_TILE_K=16 -
DBOUNDARY_TILE_I=64 -DBOUNDARY_TILE_J=16 -DBOUNDARY_TILE_K=16 -
DHOST_LEVEL_SIZE_THRESHOLD=10000 -DSPEC_OPENMP -DSPEC_OPENMP_TARGET -
DCLANG_BUG_44390 -DCUDA_UM_ALLOC -DCUDA_UM_ZERO_COPY -DMPI_ALLOC_ZERO_COPY -
DUSE_REG -DUSE_TEX -DUSE_CUDA -DSPEC_CUDA -O1 -fopenmp -std=gnu99 -fno-fast-
math -fno-unsafe-math-optimizations -I/usr/common/software/cuda/9.2.148/include
-fopenmp-targets=nvptx64-nvidia-cuda -O0 -c directives/smooth-debug-apply-op-
ijk.c -o smooth-debug-apply-op-ijk.o
For comparison. The host version of this function produces expected results at
all optimization levels. Also the device version of this function compiled with
the IBM XLC compiler produces expected results at all optimization levels,
including -Ofast.
Thanks,
Chris
Could you attach the .s file from XLC at O1, for example?
Attached clang-compiler-output-jan-22-2020.tar.bz2
(495843 bytes, application/x-bzip2): LLVM/Clang IR for function giving incorrect results
FWIW, we traced this down to a backend or NVPTX problem. Further investigation pending but so far I can rule out a frontend/middlend bug.
Attached CLANG_BUG_44390.tar
(45056 bytes, application/x-tar): Standalone reproducer
I tried disabling SeparateConstOffsetFromGEPPass in NVPTXTargetMachine.cpp and in this way, I can get the test passed. However, I haven't found any issue in SeparateConstOffsetFromGEPPass. (SeparateConstOffsetFromGEPPass could create more opportunities for GVN, LICM, and CGP, but I haven't had luck finding any issue in those passes yet)
Do you have the IR before and after the pass? I can help looking at it to determine if that is the source of the error or not.
Attached smooth-debug-apply-op-ijk-modified.tar.gz
(19226 bytes, application/x-gzip): Assembly of smooth-debug-apply-op-ijk
I'm a little puzzled here. The failure no longer happens with upstream
LLVM/Clang. It looks like the fix was in place in trunk by Feb 6 2020. The bug
still happens with the released version of LLVM/Clang-10.0.0.
trunk 20191223 with cuda/10.1.168 (79b3325be0b016fdc1a2c55bce65ec9f1e5f4eb6):
FAIL
10.0.0 release with cuda/10.1.243 (d32170dbd5b0d54436537b6b75beaf44324e0c28):
FAIL
trunk 20200206 with cuda/10.1.243 (80e17e5fcc09dc5baa940022e6988fcb08c5d92d):
SUCCESS
trunk 20200225 with cuda/10.1.243 (63cef621f954eb87c494021725f4eeac89132d16):
SUCCESS
trunk 20200313 with cuda/10.1.243 (20e36f31dfc1bb079dc6e6db5f692a4e90aa0c9d):
SUCCESS
trunk 20200407 with cuda/10.1.243 (f85ae058f580e9d74c4a8f2f0de168c18da6150f):
SUCCESS.
Has there been a known fix or is the bug silently hidden because the compiler
generates slightly different LLVM IR today?
bug-report.tar.bz2
(13531 bytes, application/x-bzip2)clang-compiler-output-jan-22-2020.tar.bz2
(495843 bytes, application/x-bzip2)CLANG_BUG_44390.tar
(45056 bytes, application/x-tar)smooth-debug-apply-op-ijk-modified.tar.gz
(19226 bytes, application/x-gzip)