OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.39k stars 1.5k forks source link

elevated lapack test failures on Arm Neoverse v1 #4335

Closed juntangc closed 11 months ago

juntangc commented 11 months ago

Easybuild maintainer reported elevated lapack-test numerical errors on Arm Neoverse-V1 (344 in total for GCC-12).

https://github.com/EESSI/software-layer/blob/77c88452405cc92ddd7bdc1a37af5f8aeb972c2f/eessi-2023.06-known-issues.yml#L16

The number of numerical errors can reduced to 14 by turn off tree-vectorize with flag -fno-tree-vectorize. Since there are also concerns that this might affect BLAS/LAPACK library performance, we did a closer look at the actual numerical differences to confirm the correctness of the tests when compiling with the flag on.

Initial result -

                        -->   LAPACK TESTING SUMMARY  <--
SUMMARY                 nb test run     numerical error         other error  
================        ===========     =================       ================  
REAL                    1315683         107     (0.008%)        0       (0.000%)
DOUBLE PRECISION        1314777         66      (0.005%)        0       (0.000%)
COMPLEX                 773609          97      (0.013%)        0       (0.000%)
COMPLEX16               776246          74      (0.010%)        0       (0.000%)
--> ALL PRECISIONS      4180315         344     (0.008%)        0       (0.000%)

234 out of 344 tests are due to small difference between two runs - one to solve the full matrix (line 738) and the other one to solve partial (upper or lower portion) of the matrix (line 770). The test program expects the results to be exact match (line 780 of https://netlib.org/lapack/explore-html/d5/d99/group__single__eig_gae07927c6321c12cd0d92450eaa21ea9c.html) or the return code is set to be the inverse of ulp (a large number) to indicate a mismatch. The functions under tests are SGGEV/DGGEV/CGGEV/ZGGEV and SGGEV3/DGGEV3/CGGEV3/ZGGEV3. I have validated the differences are indeed very small for these tests (see below for one of the example); however they aren't exact matching because the total amount of work is not the same.

alphar           1 -0.352845252     -0.352844894    
alphai           1   1.09313405       1.09313369    
beta             1  0.269765615      0.269765466    
alphar           2 -0.427688032     -0.427687764    
alphai           2  -1.32500112      -1.32500112    
beta             2  0.326986194      0.326986134    
alphar           3  -3.60646695E-02  -3.60646695E-02
alphai           3   8.01697224E-02   8.01697075E-02
beta             3  0.422834128      0.422834158    
alphar           4  -7.11821616E-02  -7.11821690E-02
alphai           4 -0.158233926     -0.158233911    
beta             4  0.834563255      0.834563375    
alphar           5  -1.16443419      -1.16443443    
alphai           5   0.00000000       0.00000000    
beta             5   1.54320097       1.54320109    

96 out of remaining 110 tests are due to a problem with test driver clavhe.f. I found by accident that by adding a statement to print out N on line 408 https://netlib.org/lapack/explore-html/d3/d9a/group__complex__lin_ga25d4e26307cae0c5c897051ce64e2e91.html, these 96 errors can be fixed. This confirms that there is no issue with the library itself and there may be something wrong with the test driver.

martin-frbg commented 11 months ago

This is a known problem with the testsuite, which is increasingly worthless for anything other than testing the reference implementation of LAPACK when built against the Reference BLAS - which, to be fair, is/was its original purpose long before cpus became capable of FMA and compilers learned to perform aggressive optimization. See for example https://github.com/Reference-LAPACK/lapack/issues/732 , also the earlier easybuild issue https://github.com/easybuilders/easybuild-easyconfigs/issues/16380

brada4 commented 11 months ago

tree-vectorize emits wildly bloated code and often overflows and underflows. the tests need fuzzy comparison, current expects exact match with reference implementation, which matches for smaller than simd kernel samples.

juntangc commented 11 months ago

@martin-frbg @brada4 Thanks for sharing the information, I assume tree-vectorize may help improve and performance. Is there any data to quantize how much performance impact disabling this option?

brada4 commented 11 months ago

It miscompiles, like over/under reading/writing the fixed size arrays BLAS has everywhere.

martin-frbg commented 11 months ago

The tree-vectorizer pass is already disabled for all of LAPACK (and any tests that use Fortran code), on the C (BLAS) side this has been handled on an individual file basis where miscompilations were found (I do not think it is helpful to claim that it miscompiles "everywhere" in general). However I was under the impression easybuild already disabled it in general, or is this only on x86_64 ? It should be noted that this option became a default only in recent versions of gcc as its developers assumed it was now stable enough.

brada4 commented 11 months ago

If you paste eg. daxpy into godbolt.org you see no-tree.... generates more or less logical code, like vector loads muls adds, if you enable tree vectorizer, it generates 10 screens of jump-laden code, around 2nd screen it does lea arg % 8 for example.

juntangc commented 11 months ago

It's good to know tree vectorize generate weird code. Strange enough, I only notice tree-vectroize cause numerical errors for test #5 for various flavors of GGEV and GGEV3. They are unlikely the only functions that require exact match of output results.

If we have individual control of compiler flag for fortran source code similar to c functions, we can selectively disable tree-vectorize for just a few subroutines.

martin-frbg commented 11 months ago

if you enable tree vectorizer, it generates 10 screens of jump-laden code, around 2nd screen it does lea arg % 8 for example.

I agree that some of its choices are a bit counter-intuitive, but I do not remember seeing anything actually wrong except in the (few) files that earned a #pragma optimize GCC "no-tree-vectorize ? However I realized in the meantime that a makefile build will still have the tree-vectorizer enabled for building the LAPACK testsuite, as the FFLAGS_DRV variable for the test drivers is not cleaned (unlike the FFLAGS for building LAPACK itself). Will fix this later...

martin-frbg commented 11 months ago

If we have individual control of compiler flag for fortran source code similar to c functions, we can selectively disable tree-vectorize for just a few subroutines.

Unfortunately I am not aware of there being any gfortran directive matching the gcc "optimize" pragma (the only remotely similar would be !GCC$ NOVECTOR preceding individual loops). One would have to modify the makefile to split off individual compilation units with different compiler flags

juntangc commented 11 months ago

If we have individual control of compiler flag for fortran source code similar to c functions, we can selectively disable tree-vectorize for just a few subroutines.

Unfortunately I am not aware of there being any gfortran directive matching the gcc "optimize" pragma (the only remotely similar would be !GCC$ NOVECTOR preceding individual loops). One would have to modify the makefile to split off individual compilation units with different compiler flags

There are concerns that some performance may be lost due to -fno-tree-vectorize. It's probably best to apply the flag to only the LAPACK functions (GGEV and GGEV3) that runs into numerical issues to limit the impact.

martin-frbg commented 11 months ago

Do I read that correctly that you are still at 0.3.21 from a year ago ? I just did a test run on Graviton3 with the current develop branch and got only two failures (one in SGGES3 and one somewhere in DST) no matter the tree-vectorizer option used with GCC11.4 .

juntangc commented 11 months ago

Do I read that correctly that you are still at 0.3.21 from a year ago ? I just did a test run on Graviton3 with the current develop branch and got only two failures (one in SGGES3 and one somewhere in DST) no matter the tree-vectorizer option used with GCC11.4 .

Yes. The ask (https://github.com/EESSI/software-layer/blob/77c88452405cc92ddd7bdc1a37af5f8aeb972c2f/eessi-2023.06-known-issues.yml#L16) from EasyBuild is to figure out why there are elevated fails with 0.3.21 with GCC 12. For 0.3.21, tree-vectorize is on. In the latest version, tree-vectorize is turned off.

brada4 commented 11 months ago

Because it is selectively disabled now, for old versions of OpenBLAS you need to disable tree-vectorize if using fresh GCC. eg. https://github.com/OpenMathLib/OpenBLAS/commit/739c3c44a77d87d1b08de59bf868250683c0f755

martin-frbg commented 11 months ago

tree-vectorize was actually turned off in response to an easybuild issue, but a lot has changed since 0.3.21 (including some of the LAPACK tests)