OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.38k stars 1.5k forks source link

Compiler Failure on POWER8 with /kernel/power/i{c,d,s,z}a{min,max}.c #2254

Closed grisuthedragon closed 4 years ago

grisuthedragon commented 5 years ago

I updated my installations on my POWER8 machine and since version 0.3.6 up to the current development branch, the build process fails with:

$ make  MAKE_NB_JOBS=1
...
cc -c -Ofast -mcpu=power8 -mtune=power8 -mvsx -malign-power -DUSE_OPENMP -fno-fast-math -fopenmp -DMAX_STACK_ALLOC=2048 -fopenmp -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DSMP_SERVER -DUSE_OPENMP -DNO_WARMUP -DMAX_CPU_NUMBER=160 -DMAX_PARALLEL_NUMBER=1 -DVERSION=\"0.3.8.dev\" -DASMNAME=isamax_k -DASMFNAME=isamax_k_ -DNAME=isamax_k_ -DCNAME=isamax_k -DCHAR_NAME=\"isamax_k_\" -DCHAR_CNAME=\"isamax_k\" -DNO_AFFINITY -I.. -UDOUBLE  -UCOMPLEX -UCOMPLEX -UDOUBLE -DUSE_ABS  -UUSE_MIN ../kernel/power/isamax.c -o isamax_k.o
../kernel/power/isamax.c: In function ‘siamax_kernel_64’:
../kernel/power/isamax.c:288:1: internal compiler error: in build_int_cst_wide, at tree.c:1210
 }
 ^
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://bugzilla.redhat.com/bugzilla> for instructions.
Preprocessed source stored into /tmp/ccriHyPV.out file, please attach this to your bugreport.
make[1]: *** [isamax_k.o] Error 1
make[1]: Leaving directory `/root/OpenBLAS/kernel'
make: *** [libs] Error 1

OS Details: CentOS 7.6 ppc64el, gcc 4.8.5, gfortran 4.8.5, IBM POWER8 LC822

martin-frbg commented 5 years ago

Interesting - this file is indeed new in 0.3.6, but (from debugging #2233) the ICE in question appears to be fixed in more recent compilers. (Unfortunately the source line 288 from the error message tells us nothing as it is simply the closing brace on the last line of the file).

grisuthedragon commented 5 years ago

If I find some time I will take a closer look inside the routines and why they are crashing.

martin-frbg commented 5 years ago

I see now that this has already been reported (against gcc 7.1.0) by susilehtlola as https://bugzilla.redhat.com/show_bug.cgi?id=1740539 (From the tests for #2233, the code compiles with gcc 5.4 and again from 7.3 onwards (7.3.0/8.2.1/9.1), so it could be that something was fixed, broken, and fixed again)

martin-frbg commented 5 years ago

Reproduced in the unicamp.br minicloud. Preprocessed source of isamax.c is here: isamax_preprocessed.txt if somebody wants to add it to the redhat bugzilla entry (@susilehtola) Also the problem goes away at -O0 (instead of our default -Ofast) so can be worked around with a #pragma GCC optimize "O0" in the affected files.

martin-frbg commented 5 years ago

Also of note is that in my tests (gcc-4.8.5 20150623 Red Hat 4.8.5-36 on CentOS 7) it is only the single precision versions (isamin/isamax.c and icamin/icamax.c) that cause an ICE.

susilehtola commented 5 years ago

@martin-frbg thanks, uploaded. However, in my experience there's very little movement on RHEL bugzilla tickets.

If someone has a RHEL subscription, they should make an issue about the bug with Red Hat; things get moving when a paying customer complains.

grisuthedragon commented 5 years ago

I did some experiments with isamax.c and found the around line 111 the snippets:

temp0 += temp1;                                                                                                                                                                                                    
temp0 += temp1; //temp0+32

and around line 170

    quadruple_indices = vec_sel( quadruple_indices,ind2,r3);
    quadruple_values= vec_sel(quadruple_values,vv0,r3);      

    temp0+=temp1;
    temp0+=temp1; //temp0+32

seem to disturb the compiler.

Update: narrow the range a bit.

quickwritereader commented 5 years ago

Could you comment/remove some lines to find which parts cause this error?

grisuthedragon commented 5 years ago

@quickwritereader That was exactly what I did. Commenting out the above mentioned lines, the code compiles.

quickwritereader commented 5 years ago

addition? could you remove that c++ style comment or replace it with c style / /

grisuthedragon commented 5 years ago

The c++ style comment is already in OpenBLAS and not from me. I tried to change "+=" to "temp0 = temp+temp1" because I had this problem in ancient times with gcc and the vetorization of a piece of code but that did not change anything.

quickwritereader commented 5 years ago

Maybe to rename/refactor temp0 to other names. feel free to change refactor. comment blocks to c style. temp0 = vec_add( temp1 , vec_add(temp0,temp1));

grisuthedragon commented 5 years ago

Renaming and replacing the addition does not help.

quickwritereader commented 5 years ago

from what versions of gcc it disappears?
remove all register hints too. Also rewrite + operations with this style result=vec_add(operand1,operand2)

grisuthedragon commented 5 years ago

I tried it with gcc 5.4 and there it works. The problem is as long as RHEL7 with gcc 4.8.5 is the recommend OS from IBM for the POWER8, it should work with the old gcc as well.

Rewritting the + to vec_add does also not help.

martin-frbg commented 5 years ago

Strangely I saw no effect of removing these lines (on the gcc 4.8.5 build that is), and have not managed to narrow it down to less than the entire siamax_kernel_64 routine. My current impression is that this is some more general issue with the __register vector declarations or perhaps an interaction with OpenMP (at least I found some resolved gcc/gfortran issues about failure of build_int_cs_wide in conjunction with OpenMP from the gcc 4.x timeframe).

grisuthedragon commented 5 years ago

Since the gcc 4.x series will pass away, a possible fix would be do deactivate the kernels for GCC < 5.x and bypass this problem this way.

quickwritereader commented 5 years ago

ok lets replace double addition with one introducing new variable before for loop after temp1=temp1 <<1 ; line __vector unsigned int add_32= temp1<<1;
or __vector unsigned int add_32= {32,32,32,32};
then change two additions with one temp0=vec_add(temp0,add_32); Afaik there should be enough registers to hold additional number.

martin-frbg commented 5 years ago

As long as we do not suspect anything fundamentally wrong with these files, perhaps the easiest solution is to wrap the #pragma I suggested above in a GCC version check - probably #if (( defined(__GNUC__) && __GNUC__ < 6 to be on the safe side. There is no need to deactivate these kernels as such, just to deactivate the default -Ofast compiler optimization for them.

quickwritereader commented 5 years ago

with recent issues, I see inline assembly was the best choice but it doubles work for the new architecture.

martin-frbg commented 5 years ago

I now see that a gcc 8.3.0 build on power8 (Ubuntu 19.04) returns 200+ errors in the single-precision complex LAPACK tests. Disabling optimization of icamin/icamax.c works around this as well (and incidentally also removes the single failure in CTFSM that we have been seeing across all architectures)

quickwritereader commented 5 years ago

I will try to check those this week.

martin-frbg commented 5 years ago

More importantly (and thanks to isuruf's help with debugging our Travis script), DYNAMIC_ARCH=1 with gcc 5.4 fails in the power9 part with another ICE:

./kernel/power/caxpy.c: In function ‘caxpy_kernel_16’:
../kernel/power/caxpy.c:62:33: internal compiler error: in rs6000_emit_le_vsx_move, at config/rs6000/rs6000.c:9157
         register __vector float vy_0 = vec_vsx_ld( offset_0 ,vptr_y ) ;
                                 ^

and this time the #pragma GCC optimize "O0" does not help. (This is probably why the CI for #2243 had failed - I had merged your PR as I could not reproduce the error in the unicamp minicloud, but now I realize I had only built for the power8 target with this old gcc) Update: this error does not occur with (at least) gcc 8.3.0.

edelsohn commented 5 years ago

Disabling features to work-around issues in GCC 4.8.5 is okay, although IBM should work with RH to backport the fixes. But ICEs in GCC 7/8 should be reported and fixed in GCC. None of this was reported to the GCC community.

quickwritereader commented 5 years ago

@martin-frbg how about to use gcc generated optimized assembly files for both power8 and power9 directly? I intended to convert them inline assemblies but had to postpone.

martin-frbg commented 5 years ago

Yes I guess that would work (assuming the generated assembly is not totally unreadable from a human perspective). Once we are convinced that we have found a working version that is - 8.3.0 miscompiling icamin/icamax certainly complicated matters, and I suspect creating self-contained test cases for the GCC folks may be a non-trivial task as well.

edelsohn commented 5 years ago

It's not great to create too much hand-written assembly if the compiler produces efficient code. The compiler will adapt to future processors. Hand-written assembly code creates another location to update.

If the testcase cannot be reduced, please open a GCC Bugzilla with the pre-processed source file.

quickwritereader commented 5 years ago

@martin-frbg could you check if it passes after reverting caxpy fix. I had to use intrinsics because of power8 error if you remember.

edelsohn commented 5 years ago

Also, you can try to use creduce program to create a minimal testcase. Or upload the entire pre-processed source file and the IBM Toolchain team will reduce it.

quickwritereader commented 5 years ago

@martin-frbg thanks for the generating and merging assembly kernels using conditions. One could later further optimize generated assemblies.

martin-frbg commented 5 years ago

Closed by the commit, reopening as the underlying compiler issue has not yet been reported to the gcc team (while the related RH bugzilla entry has seen no activity). Might also make sense to review my choice of a minimum gcc version for PPC9 support - perhaps turning this into a choice between the original and intrinsics versions of the affected code(s) would work as well.

edelsohn commented 5 years ago

I believe that the original bug already was fixed. It works in GCC 5.4, 7.3, and 8.x because of backports from GCC 8. For some reason GCC 6.x was overlooked.

IBM will work with RH to backport the patch to GCC 4.8.5 RH in RHEL.

I am concerned about the second bug:

I now see that a gcc 8.3.0 build on power8 (Ubuntu 19.04) returns 200+ errors in the single-precision complex LAPACK tests. Disabling optimization of icamin/icamax.c works around this as well (and incidentally also removes the single failure in CTFSM that we have been seeing across all architectures)

The reduced test corresponds to the first bug. The second bug separately needs to be opened upstream in GCC Bugzilla. It appears to be a separate, undiagnosed and unfixed issue.

martin-frbg commented 5 years ago

There is also the ICE with GCC 5.4 compiling caxpy.c with -mcpu=power9 (no problem with the exact same source and -mcpu=power8). mentioned above preprocessed source: caxpy.txt

edelsohn commented 5 years ago

I don't have access to GCC 5.4.

IBM Advance Toolchain 9.0 includes a patched version of GCC 5.5. I cannot reproduce the failure, although you don't provide the exact invocation. IBM AT9.0 is far out of date and no longer supported. GCC 5.x is out of date and no longer supported.

The comment above about GCC 5.4 (https://github.com/xianyi/OpenBLAS/issues/2254#issuecomment-532110686) reports that it does not fail with GCC 8.3. IBM Advanced Toolchain 12.0 include GCC 8.3, and I similarly do not elicit and error.

Why does the failure on GCC 5.4 matter? Other than GCC 4.8.5 RH, I don't understand why any of the older, out of date releases of GCC matter, nor why OpenBLAS should make any changes to accommodate them.

Is Unicamp Power Minicloud using GCC 5.4?

martin-frbg commented 5 years ago

GCC 5.4 is or was what Ubuntu 16 came with, which just happens to be what our current Travis CI job uses. One of the twenty or so Unicamp VM choices comes with Ubuntu 16.04, which allowed me to reproduce and debug the initial CI build failure. Other than that, the GCC 5.4 ICE probably does not matter, as most affected users should be able to update their compiler. And an obvious compile failure is certainly better in this regard than a silent miscompilation.

edelsohn commented 5 years ago

Can you please produce pre-processed source code for the icamax.c error with a specific compiler invocation for GCC 8.3? You can open a GCC Bugzilla issue or I can open one.

martin-frbg commented 5 years ago

Preprocessed source : icamax.txt Compiler options are -Ofast -mcpu=power8 -mtune=power8 -mvsx -malign-power -fno-fast-math -fopenmp -m64 "Bad" intermediate assembly from this: icamax_badS.txt "Good" intermediate assembly from overriding the -Ofast with a pragma GCC optimize "O0": icamax_goodS.txt

GCC info: gcc version 8.3.0 (Ubuntu 8.3.0-6ubuntu1)

edelsohn commented 5 years ago

Do you have a driver program that demonstrates the wrong result? Have you tried anything other than O0? The differences are vast between highly optimized and no optimization, so it is difficult to ascertain what is miscompiled.

martin-frbg commented 5 years ago

No simple driver program so far, this problem was originally exposed only by running the LAPACK testsuite included in the netlib reference implementation. (Which is why I wrote "silent miscompilation" of the library earlier). I do not think I did experiment with other optimization levels for this particular case - the O0 workaround was already in place for the older gcc versions at the time (and there O1 was already sufficient to cause the original ICE that started all this), I really should not be spending time on this right now, but I'll see if I can do a quick test at -O1

edelsohn commented 5 years ago

I can try to run through the variants if you can provide the incantation to run the LAPACK testsuite in OpenBLAS.

It really would be helpful to try -O3 and -O2 -- something closer to -Ofast.

martin-frbg commented 5 years ago

It is simply make; make lapack-test - "good" shows no numeric errors for single-precision complex, "bad" has 275 of them. And it already goes wrong at -O1, I'll see if I can narrow it down to one of the 40-odd options enabled by that.

edelsohn commented 5 years ago

It fails with -O1? Okay. The individual flags do not control all of the -O1 options, so you may not be able to find the exact optimization.

martin-frbg commented 5 years ago

Indeed none of the options documented in https://gcc.gnu.org/onlinedocs/gcc-8.3.0/gcc/Optimize-Options.html as being enabled by -O1 triggers the test failures. Trying now with the options flagged in a diff of the gcc -Q --help=optimize output at -O0 and -O1. (Update: No hit there either)

edelsohn commented 5 years ago

First, Abdel always recommended that I build OpenBLAS with NO_LAPACK=1.

I built OpenBLAS on CentOS 7 running on a Power8 system in the GNU Compile Farm (gcc112) using IBM AT12.0, which is GCC 8.3.1 with additional patches backported.

The initial results are

            -->   LAPACK TESTING SUMMARY  <--
SUMMARY                 nb test run     numerical error     other error  
================    =========== =================   ================  
REAL                1284869     1   (0.000%)    5   (0.000%)
DOUBLE PRECISION    1039157     1   (0.000%)    3   (0.000%)
COMPLEX             14868       67620   (454.802%)  111 (0.747%)
COMPLEX16           424579      0   (0.000%)    2   (0.000%)

--> ALL PRECISIONS  2763473     67622   (2.447%)    121 (0.004%)

I see no excessive failures for REAL. The failures for COMPLEX seem to be related to memory and I noticed the comments in the documentation about ulimit -s unlimited. I am re-running the tests with that environment variable. I will report the results when the test completes.

At the moment I cannot reproduce the failures with GCC 8.3.1 to which I have access. My initial assessment is this failure already was fixed in a more recent release of GCC (not necessarily GCC 8.3.1).

edelsohn commented 5 years ago

I misread that the problem was with REAL and not single precision COMPLEX. There does seem to be a problem with COMPLEX. Re-running with ulimit -s unlimited shows numerous errors.

            -->   LAPACK TESTING SUMMARY  <--
SUMMARY                 nb test run     numerical error     other error  
================    =========== =================   ================  
REAL                1284869     1   (0.000%)    5   (0.000%)
DOUBLE PRECISION    1293457     0   (0.000%)    5   (0.000%)
COMPLEX             20868       99053   (474.665%)  1129    (5.410%)
COMPLEX16           753628      0   (0.000%)    10  (0.001%)

--> ALL PRECISIONS  3352822     99054   (2.954%)    1149    (0.034%)
quickwritereader commented 5 years ago

if it compiles fine I will re-write them all explicitly using vec_vsx_ld, vec_vsx_st.

quickwritereader commented 5 years ago

2233 I feel it is another gcc-7 vs gcc-8 problem

edelsohn commented 5 years ago

GCC for Power originally implemented builtins in a very old and crude manner that treated them like a black box. The PPC Toolchain team has been converting them to expand to the internal representation, when there is a correspondence like loads and stores, so that the compiler can reason about them and optimize them. This transition is occurring across the compilers that you are testing (GCC 4.8 to GCC 9).

There also were some early bugs in the swapping of vector elements in registers for little endian. It's possible that hand written assembly written with an earlier compiler assumed an incorrect element ordering and now is producing incorrect results when presented with elements in the correct (Little Endian) ordering.

Reading #2233 this seems like the same issue and it never fully was diagnosed.

With more recent releases of GCC, ideally one should be using normal assignment to load and store instead of explicit vec_vsx_ld / vec_vsx_st to allow the compiler to choose the best load and store flavors.

quickwritereader commented 5 years ago

I will reconsider that ,this time. I will try to use macroses so that we could change it to subscript [index] or vec_vsx_ld in future and also I will try to use vec_step for converting my loops.

martin-frbg commented 5 years ago

Your test results for COMPLEX are much worse than what the Ubuntu gcc 8.3.0 produced (for a pre-#2260 snapshot of the develop branch), which revision did you build ? If we are up against an immature compiler, providing workarounds for known problems seems all the more appropriate to me.