OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.29k stars 1.49k forks source link

Test failure on PPC: Nonsymmetric-Generalized-Eigenvalue-Problem-driver-EIG/xeigtsts #4415

Closed Flamefire closed 8 months ago

Flamefire commented 8 months ago

Since 0.3.23 I have (besides ~41 numerical errors) also 1 "other error".

The summary (in this case using 0.3.26) looks like this:

                        -->   LAPACK TESTING SUMMARY  <--
SUMMARY                 nb test run     numerical error         other error  
================        ===========     =================       ================  
REAL                    1559622         29      (0.002%)        1       (0.000%)        
DOUBLE PRECISION        1570470         0       (0.000%)        0       (0.000%)        
COMPLEX                 1028638         12      (0.001%)        0       (0.000%)        
COMPLEX16               1030797         0       (0.000%)        0       (0.000%)        

--> ALL PRECISIONS      5189527         41      (0.001%)        1       (0.000%) 

And searching for that "other error" I found

Testing REAL              Nonsymmetric-Generalized-Eigenvalue-Problem-driver-EIG/xeigtsts < sgd.in > sgd.out  SDRGES: SGGES returned INFO=     9.
  SGS drivers:      1 out of   1555 tests failed to pass the threshold
  SGV drivers:     20 out of   1092 tests failed to pass the threshold
 passed: 7830
failing to pass the threshold: 21
Info Error: 1

I was using make lapack-test BINARY='64' CC='gcc' FC='gfortran' MAKE_NB_JOBS='-1' USE_OPENMP='1' USE_THREAD='1' for this.

Any idea what this could be and how to fix it?

martin-frbg commented 8 months ago

INFO= 9 from SGGES appears to be "QZ algorithm failed to converge" - unfortunately some of the tests in the LAPACK testsuite are quite fragile against small numerical differences compared to the unoptimized Reference BLAS. Which flavor of PPC is this ?

Flamefire commented 8 months ago

INFO= 9 from SGGES appears to be "QZ algorithm failed to converge" - unfortunately some of the tests in the LAPACK testsuite are quite fragile against small numerical differences compared to the unoptimized Reference BLAS. Which flavor of PPC is this ?

This is a Power9 CPU (PowerNV 8335-GTX), cpuinfo shows "POWER9, altivec supported"

Flamefire commented 8 months ago

Ouch, this is tricky. We have the following patch:

--- OpenBLAS-0.3.23/lapack-netlib/TESTING/sgd.in.orig   2023-06-06 11:01:50.512947527 +0000
+++ OpenBLAS-0.3.23/lapack-netlib/TESTING/sgd.in        2023-06-06 11:02:05.318078733 +0000
@@ -1,6 +1,6 @@
 SGS               Data for the Real Nonsymmetric Schur Form Driver
 5                 Number of matrix dimensions
-2 6 10 12 20 30   Matrix dimensions
+6 2 10 12 20 30   Matrix dimensions
 1 1 1 2 1         Parameters NB, NBMIN, NXOVER, NS, NBCOL
 10                Threshold for test ratios
 .TRUE.            Put T to test the error exits

With this patch the mentioned issue happens on PPC. Without it, it happens on AArch.

martin-frbg commented 8 months ago

I remember playing with varying sequences in some test inputs before, but it does not really make sense if these are subsequent runs with varied matrix dimensions - unless there are uninitialized variables involved (never zeroed, or not zeroed between runs)...

Flamefire commented 8 months ago

Turns out the actual AArch issue was for the double precision variant and fixed by removing the "6" as you suggested in https://github.com/OpenMathLib/OpenBLAS/issues/4032#issuecomment-1739984814

I tried the same for sgd and the test runs successfully on PPC. I.e. I now have this patch which I'll test on a larger number of CPU architectures:

--- OpenBLAS-0.3.23/lapack-netlib/TESTING/dgd.in.orig   2023-09-29 08:05:53.089031858 +0200
+++ OpenBLAS-0.3.23/lapack-netlib/TESTING/dgd.in    2023-09-29 08:08:32.234680735 +0200
@@ -1,6 +1,6 @@
 DGS               Data for the Real Nonsymmetric Schur Form Driver
 5                 Number of matrix dimensions
-2 6 10 12 20 30   Matrix dimensions
+2 10 12 20 30     Matrix dimensions
 1 1 1 2 1         Parameters NB, NBMIN, NXOVER, NS, NBCOL
 10                Threshold for test ratios
 .TRUE.            Put T to test the error exits
--- OpenBLAS-0.3.23/lapack-netlib/TESTING/sgd.in.orig   2023-06-06 11:01:50.512947527 +0000
+++ OpenBLAS-0.3.23/lapack-netlib/TESTING/sgd.in        2023-06-06 11:02:05.318078733 +0000
@@ -1,6 +1,6 @@
 SGS               Data for the Real Nonsymmetric Schur Form Driver
 5                 Number of matrix dimensions
-2 6 10 12 20 30   Matrix dimensions
+2 10 12 20 30     Matrix dimensions
 1 1 1 2 1         Parameters NB, NBMIN, NXOVER, NS, NBCOL
 10                Threshold for test ratios
 .TRUE.            Put T to test the error exits
martin-frbg commented 8 months ago

OK, that makes it a little better - could claim that the matrix of size 6 is somehow ill conditioned (ISTR there actually were a few issues in Reference-LAPACK with "broken" pencils for GGEV/GGES)

Flamefire commented 8 months ago

Yes that matches the observation of @bartoldeman who did an earlier patch with a comment

Avoid a nearly singular matrix in lapack testing, which can trigger an error depending on FMA.

How can one understand that change/line exactly? You wrote

remove the "6" from the first list of matrix dimensions in line 3 of that file (6 eigenvalues + error code 3 => INFO=9)

then there is "Number of matrix dimensions" = 5, but there are 6 values for "Matrix dimensions" (after the above, new patch only 5). So I'm curious what exactly is specified there.

bartoldeman commented 8 months ago

The "5" could be seen as an error in the original test file; what happens if you use:

5                 Number of matrix dimensions
2 6 10 12 20 30   Matrix dimensions

is that it'll use 2 6 10 12 20 but ignore 30. It's just Fortran code that reads 5 numbers, then discards the rest of the line.

martin-frbg commented 8 months ago

yes, six values but only the first five are actually used - maybe matrix size 30 was simply too big to use in routine testing on whatever ancient hardware was in use at the time that test was conceived. There are a few testsuite-related issues open in Reference-LAPACK, including at least one by bartoldeman, but other than agreeing that the testsuite seems fragile and pointing to the netlib faq that already discusses "minor" failures caused by use of optimized libraries there tends to be little activity (I could blame myself here as well though)

(oops - I simply took longer to type my response than bartoldeman :) )