Closed Flamefire closed 8 months ago
INFO= 9 from SGGES appears to be "QZ algorithm failed to converge" - unfortunately some of the tests in the LAPACK testsuite are quite fragile against small numerical differences compared to the unoptimized Reference BLAS. Which flavor of PPC is this ?
INFO= 9 from SGGES appears to be "QZ algorithm failed to converge" - unfortunately some of the tests in the LAPACK testsuite are quite fragile against small numerical differences compared to the unoptimized Reference BLAS. Which flavor of PPC is this ?
This is a Power9 CPU (PowerNV 8335-GTX), cpuinfo shows "POWER9, altivec supported"
Ouch, this is tricky. We have the following patch:
--- OpenBLAS-0.3.23/lapack-netlib/TESTING/sgd.in.orig 2023-06-06 11:01:50.512947527 +0000
+++ OpenBLAS-0.3.23/lapack-netlib/TESTING/sgd.in 2023-06-06 11:02:05.318078733 +0000
@@ -1,6 +1,6 @@
SGS Data for the Real Nonsymmetric Schur Form Driver
5 Number of matrix dimensions
-2 6 10 12 20 30 Matrix dimensions
+6 2 10 12 20 30 Matrix dimensions
1 1 1 2 1 Parameters NB, NBMIN, NXOVER, NS, NBCOL
10 Threshold for test ratios
.TRUE. Put T to test the error exits
With this patch the mentioned issue happens on PPC. Without it, it happens on AArch.
I remember playing with varying sequences in some test inputs before, but it does not really make sense if these are subsequent runs with varied matrix dimensions - unless there are uninitialized variables involved (never zeroed, or not zeroed between runs)...
Turns out the actual AArch issue was for the double precision variant and fixed by removing the "6" as you suggested in https://github.com/OpenMathLib/OpenBLAS/issues/4032#issuecomment-1739984814
I tried the same for sgd and the test runs successfully on PPC. I.e. I now have this patch which I'll test on a larger number of CPU architectures:
--- OpenBLAS-0.3.23/lapack-netlib/TESTING/dgd.in.orig 2023-09-29 08:05:53.089031858 +0200
+++ OpenBLAS-0.3.23/lapack-netlib/TESTING/dgd.in 2023-09-29 08:08:32.234680735 +0200
@@ -1,6 +1,6 @@
DGS Data for the Real Nonsymmetric Schur Form Driver
5 Number of matrix dimensions
-2 6 10 12 20 30 Matrix dimensions
+2 10 12 20 30 Matrix dimensions
1 1 1 2 1 Parameters NB, NBMIN, NXOVER, NS, NBCOL
10 Threshold for test ratios
.TRUE. Put T to test the error exits
--- OpenBLAS-0.3.23/lapack-netlib/TESTING/sgd.in.orig 2023-06-06 11:01:50.512947527 +0000
+++ OpenBLAS-0.3.23/lapack-netlib/TESTING/sgd.in 2023-06-06 11:02:05.318078733 +0000
@@ -1,6 +1,6 @@
SGS Data for the Real Nonsymmetric Schur Form Driver
5 Number of matrix dimensions
-2 6 10 12 20 30 Matrix dimensions
+2 10 12 20 30 Matrix dimensions
1 1 1 2 1 Parameters NB, NBMIN, NXOVER, NS, NBCOL
10 Threshold for test ratios
.TRUE. Put T to test the error exits
OK, that makes it a little better - could claim that the matrix of size 6 is somehow ill conditioned (ISTR there actually were a few issues in Reference-LAPACK with "broken" pencils for GGEV/GGES)
Yes that matches the observation of @bartoldeman who did an earlier patch with a comment
Avoid a nearly singular matrix in lapack testing, which can trigger an error depending on FMA.
How can one understand that change/line exactly? You wrote
remove the "6" from the first list of matrix dimensions in line 3 of that file (6 eigenvalues + error code 3 => INFO=9)
then there is "Number of matrix dimensions" = 5, but there are 6 values for "Matrix dimensions" (after the above, new patch only 5). So I'm curious what exactly is specified there.
The "5" could be seen as an error in the original test file; what happens if you use:
5 Number of matrix dimensions
2 6 10 12 20 30 Matrix dimensions
is that it'll use 2 6 10 12 20 but ignore 30. It's just Fortran code that reads 5 numbers, then discards the rest of the line.
yes, six values but only the first five are actually used - maybe matrix size 30 was simply too big to use in routine testing on whatever ancient hardware was in use at the time that test was conceived. There are a few testsuite-related issues open in Reference-LAPACK, including at least one by bartoldeman, but other than agreeing that the testsuite seems fragile and pointing to the netlib faq that already discusses "minor" failures caused by use of optimized libraries there tends to be little activity (I could blame myself here as well though)
(oops - I simply took longer to type my response than bartoldeman :) )
Since 0.3.23 I have (besides ~41 numerical errors) also 1 "other error".
The summary (in this case using 0.3.26) looks like this:
And searching for that "other error" I found
I was using
make lapack-test BINARY='64' CC='gcc' FC='gfortran' MAKE_NB_JOBS='-1' USE_OPENMP='1' USE_THREAD='1'
for this.Any idea what this could be and how to fix it?