Closed grisuthedragon closed 4 years ago
Interesting - this file is indeed new in 0.3.6, but (from debugging #2233) the ICE in question appears to be fixed in more recent compilers. (Unfortunately the source line 288 from the error message tells us nothing as it is simply the closing brace on the last line of the file).
If I find some time I will take a closer look inside the routines and why they are crashing.
I see now that this has already been reported (against gcc 7.1.0) by susilehtlola as https://bugzilla.redhat.com/show_bug.cgi?id=1740539 (From the tests for #2233, the code compiles with gcc 5.4 and again from 7.3 onwards (7.3.0/8.2.1/9.1), so it could be that something was fixed, broken, and fixed again)
Reproduced in the unicamp.br minicloud. Preprocessed source of isamax.c is here:
isamax_preprocessed.txt if somebody wants to add it to the redhat bugzilla entry (@susilehtola)
Also the problem goes away at -O0 (instead of our default -Ofast) so can be worked around with
a #pragma GCC optimize "O0"
in the affected files.
Also of note is that in my tests (gcc-4.8.5 20150623 Red Hat 4.8.5-36 on CentOS 7) it is only the single precision versions (isamin/isamax.c and icamin/icamax.c) that cause an ICE.
@martin-frbg thanks, uploaded. However, in my experience there's very little movement on RHEL bugzilla tickets.
If someone has a RHEL subscription, they should make an issue about the bug with Red Hat; things get moving when a paying customer complains.
I did some experiments with isamax.c and found the around line 111 the snippets:
temp0 += temp1;
temp0 += temp1; //temp0+32
and around line 170
quadruple_indices = vec_sel( quadruple_indices,ind2,r3);
quadruple_values= vec_sel(quadruple_values,vv0,r3);
temp0+=temp1;
temp0+=temp1; //temp0+32
seem to disturb the compiler.
Update: narrow the range a bit.
Could you comment/remove some lines to find which parts cause this error?
@quickwritereader That was exactly what I did. Commenting out the above mentioned lines, the code compiles.
addition? could you remove that c++ style comment or replace it with c style / /
The c++ style comment is already in OpenBLAS and not from me. I tried to change "+=" to "temp0 = temp+temp1" because I had this problem in ancient times with gcc and the vetorization of a piece of code but that did not change anything.
Maybe to rename/refactor temp0 to other names. feel free to change refactor. comment blocks to c style. temp0 = vec_add( temp1 , vec_add(temp0,temp1));
Renaming and replacing the addition does not help.
from what versions of gcc it disappears?
remove all register hints too.
Also rewrite + operations with this style result=vec_add(operand1,operand2)
I tried it with gcc 5.4 and there it works. The problem is as long as RHEL7 with gcc 4.8.5 is the recommend OS from IBM for the POWER8, it should work with the old gcc as well.
Rewritting the + to vec_add does also not help.
Strangely I saw no effect of removing these lines (on the gcc 4.8.5 build that is), and have not managed to narrow it down to less than the entire siamax_kernel_64 routine. My current impression is that this is some more general issue with the __register vector
declarations or perhaps an interaction with OpenMP (at least I found some resolved gcc/gfortran issues about failure of build_int_cs_wide in conjunction with OpenMP from the gcc 4.x timeframe).
Since the gcc 4.x series will pass away, a possible fix would be do deactivate the kernels for GCC < 5.x and bypass this problem this way.
ok lets replace double addition with one introducing new variable before for loop after temp1=temp1 <<1 ; line
__vector unsigned int add_32= temp1<<1;
or
__vector unsigned int add_32= {32,32,32,32};
then change two additions with one
temp0=vec_add(temp0,add_32);
Afaik there should be enough registers to hold additional number.
As long as we do not suspect anything fundamentally wrong with these files, perhaps the easiest
solution is to wrap the #pragma
I suggested above in a GCC version check - probably
#if (( defined(__GNUC__) && __GNUC__ < 6
to be on the safe side.
There is no need to deactivate these kernels as such, just to deactivate the default -Ofast
compiler optimization for them.
with recent issues, I see inline assembly was the best choice but it doubles work for the new architecture.
I now see that a gcc 8.3.0 build on power8 (Ubuntu 19.04) returns 200+ errors in the single-precision complex LAPACK tests. Disabling optimization of icamin/icamax.c works around this as well (and incidentally also removes the single failure in CTFSM that we have been seeing across all architectures)
I will try to check those this week.
More importantly (and thanks to isuruf's help with debugging our Travis script), DYNAMIC_ARCH=1 with gcc 5.4 fails in the power9 part with another ICE:
./kernel/power/caxpy.c: In function ‘caxpy_kernel_16’:
../kernel/power/caxpy.c:62:33: internal compiler error: in rs6000_emit_le_vsx_move, at config/rs6000/rs6000.c:9157
register __vector float vy_0 = vec_vsx_ld( offset_0 ,vptr_y ) ;
^
and this time the #pragma GCC optimize "O0"
does not help. (This is probably why the CI for #2243 had failed - I had merged your PR as I could not reproduce the error in the unicamp minicloud, but now I realize I had only built for the power8 target with this old gcc)
Update: this error does not occur with (at least) gcc 8.3.0.
Disabling features to work-around issues in GCC 4.8.5 is okay, although IBM should work with RH to backport the fixes. But ICEs in GCC 7/8 should be reported and fixed in GCC. None of this was reported to the GCC community.
@martin-frbg how about to use gcc generated optimized assembly files for both power8 and power9 directly? I intended to convert them inline assemblies but had to postpone.
Yes I guess that would work (assuming the generated assembly is not totally unreadable from a human perspective). Once we are convinced that we have found a working version that is - 8.3.0 miscompiling icamin/icamax certainly complicated matters, and I suspect creating self-contained test cases for the GCC folks may be a non-trivial task as well.
It's not great to create too much hand-written assembly if the compiler produces efficient code. The compiler will adapt to future processors. Hand-written assembly code creates another location to update.
If the testcase cannot be reduced, please open a GCC Bugzilla with the pre-processed source file.
@martin-frbg could you check if it passes after reverting caxpy fix. I had to use intrinsics because of power8 error if you remember.
Also, you can try to use creduce program to create a minimal testcase. Or upload the entire pre-processed source file and the IBM Toolchain team will reduce it.
@martin-frbg thanks for the generating and merging assembly kernels using conditions. One could later further optimize generated assemblies.
Closed by the commit, reopening as the underlying compiler issue has not yet been reported to the gcc team (while the related RH bugzilla entry has seen no activity). Might also make sense to review my choice of a minimum gcc version for PPC9 support - perhaps turning this into a choice between the original and intrinsics versions of the affected code(s) would work as well.
I believe that the original bug already was fixed. It works in GCC 5.4, 7.3, and 8.x because of backports from GCC 8. For some reason GCC 6.x was overlooked.
IBM will work with RH to backport the patch to GCC 4.8.5 RH in RHEL.
I am concerned about the second bug:
I now see that a gcc 8.3.0 build on power8 (Ubuntu 19.04) returns 200+ errors in the single-precision complex LAPACK tests. Disabling optimization of icamin/icamax.c works around this as well (and incidentally also removes the single failure in CTFSM that we have been seeing across all architectures)
The reduced test corresponds to the first bug. The second bug separately needs to be opened upstream in GCC Bugzilla. It appears to be a separate, undiagnosed and unfixed issue.
There is also the ICE with GCC 5.4 compiling caxpy.c with -mcpu=power9 (no problem with the exact same source and -mcpu=power8). mentioned above preprocessed source: caxpy.txt
I don't have access to GCC 5.4.
IBM Advance Toolchain 9.0 includes a patched version of GCC 5.5. I cannot reproduce the failure, although you don't provide the exact invocation. IBM AT9.0 is far out of date and no longer supported. GCC 5.x is out of date and no longer supported.
The comment above about GCC 5.4 (https://github.com/xianyi/OpenBLAS/issues/2254#issuecomment-532110686) reports that it does not fail with GCC 8.3. IBM Advanced Toolchain 12.0 include GCC 8.3, and I similarly do not elicit and error.
Why does the failure on GCC 5.4 matter? Other than GCC 4.8.5 RH, I don't understand why any of the older, out of date releases of GCC matter, nor why OpenBLAS should make any changes to accommodate them.
Is Unicamp Power Minicloud using GCC 5.4?
GCC 5.4 is or was what Ubuntu 16 came with, which just happens to be what our current Travis CI job uses. One of the twenty or so Unicamp VM choices comes with Ubuntu 16.04, which allowed me to reproduce and debug the initial CI build failure. Other than that, the GCC 5.4 ICE probably does not matter, as most affected users should be able to update their compiler. And an obvious compile failure is certainly better in this regard than a silent miscompilation.
Can you please produce pre-processed source code for the icamax.c error with a specific compiler invocation for GCC 8.3? You can open a GCC Bugzilla issue or I can open one.
Preprocessed source : icamax.txt
Compiler options are -Ofast -mcpu=power8 -mtune=power8 -mvsx -malign-power -fno-fast-math -fopenmp -m64
"Bad" intermediate assembly from this: icamax_badS.txt
"Good" intermediate assembly from overriding the -Ofast
with a pragma GCC optimize "O0"
: icamax_goodS.txt
GCC info: gcc version 8.3.0 (Ubuntu 8.3.0-6ubuntu1)
Do you have a driver program that demonstrates the wrong result? Have you tried anything other than O0? The differences are vast between highly optimized and no optimization, so it is difficult to ascertain what is miscompiled.
No simple driver program so far, this problem was originally exposed only by running the LAPACK testsuite included in the netlib reference implementation. (Which is why I wrote "silent miscompilation" of the library earlier). I do not think I did experiment with other optimization levels for this particular case - the O0 workaround was already in place for the older gcc versions at the time (and there O1 was already sufficient to cause the original ICE that started all this), I really should not be spending time on this right now, but I'll see if I can do a quick test at -O1
I can try to run through the variants if you can provide the incantation to run the LAPACK testsuite in OpenBLAS.
It really would be helpful to try -O3
and -O2
-- something closer to -Ofast
.
It is simply make; make lapack-test
- "good" shows no numeric errors for single-precision complex, "bad" has 275 of them. And it already goes wrong at -O1, I'll see if I can narrow it down to one of the 40-odd options enabled by that.
It fails with -O1
? Okay. The individual flags do not control all of the -O1
options, so you may not be able to find the exact optimization.
Indeed none of the options documented in https://gcc.gnu.org/onlinedocs/gcc-8.3.0/gcc/Optimize-Options.html as being enabled by -O1
triggers the test failures. Trying now with the options flagged in a diff of the gcc -Q --help=optimize
output at -O0 and -O1. (Update: No hit there either)
First, Abdel always recommended that I build OpenBLAS with NO_LAPACK=1
.
I built OpenBLAS on CentOS 7 running on a Power8 system in the GNU Compile Farm (gcc112) using IBM AT12.0, which is GCC 8.3.1 with additional patches backported.
The initial results are
--> LAPACK TESTING SUMMARY <--
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 1284869 1 (0.000%) 5 (0.000%)
DOUBLE PRECISION 1039157 1 (0.000%) 3 (0.000%)
COMPLEX 14868 67620 (454.802%) 111 (0.747%)
COMPLEX16 424579 0 (0.000%) 2 (0.000%)
--> ALL PRECISIONS 2763473 67622 (2.447%) 121 (0.004%)
I see no excessive failures for REAL. The failures for COMPLEX seem to be related to memory and I noticed the comments in the documentation about ulimit -s unlimited
. I am re-running the tests with that environment variable. I will report the results when the test completes.
At the moment I cannot reproduce the failures with GCC 8.3.1 to which I have access. My initial assessment is this failure already was fixed in a more recent release of GCC (not necessarily GCC 8.3.1).
I misread that the problem was with REAL and not single precision COMPLEX. There does seem to be a problem with COMPLEX. Re-running with ulimit -s unlimited shows numerous errors.
--> LAPACK TESTING SUMMARY <--
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 1284869 1 (0.000%) 5 (0.000%)
DOUBLE PRECISION 1293457 0 (0.000%) 5 (0.000%)
COMPLEX 20868 99053 (474.665%) 1129 (5.410%)
COMPLEX16 753628 0 (0.000%) 10 (0.001%)
--> ALL PRECISIONS 3352822 99054 (2.954%) 1149 (0.034%)
if it compiles fine I will re-write them all explicitly using vec_vsx_ld, vec_vsx_st.
GCC for Power originally implemented builtins in a very old and crude manner that treated them like a black box. The PPC Toolchain team has been converting them to expand to the internal representation, when there is a correspondence like loads and stores, so that the compiler can reason about them and optimize them. This transition is occurring across the compilers that you are testing (GCC 4.8 to GCC 9).
There also were some early bugs in the swapping of vector elements in registers for little endian. It's possible that hand written assembly written with an earlier compiler assumed an incorrect element ordering and now is producing incorrect results when presented with elements in the correct (Little Endian) ordering.
Reading #2233 this seems like the same issue and it never fully was diagnosed.
With more recent releases of GCC, ideally one should be using normal assignment to load and store instead of explicit vec_vsx_ld / vec_vsx_st to allow the compiler to choose the best load and store flavors.
I will reconsider that ,this time. I will try to use macroses so that we could change it to subscript [index] or vec_vsx_ld in future and also I will try to use vec_step for converting my loops.
Your test results for COMPLEX are much worse than what the Ubuntu gcc 8.3.0 produced (for a pre-#2260 snapshot of the develop branch), which revision did you build ? If we are up against an immature compiler, providing workarounds for known problems seems all the more appropriate to me.
I updated my installations on my POWER8 machine and since version 0.3.6 up to the current development branch, the build process fails with:
OS Details: CentOS 7.6 ppc64el, gcc 4.8.5, gfortran 4.8.5, IBM POWER8 LC822