Less accuracy due to FMAs

ACSimon33 commented 3 months ago

Hi,

I think the topic of FMA instructions came up a couple of times before, but my question is what kind of accuracy loss do we consider as acceptable given that this is a reference implementation. The point where I stumbled across it was an eigenvalue decomposition of a 2x2 matrix

   A = [  4 1 ]
       [ -4 0 ]

The eigenvalues should be [2.0, 2.0]. When the compiler uses FMAs we get [1.999999988777289 2.000000011222711] instead. Using gfortran on Linux one has to enable native intrinsic (-march=native) to accomplish that, but on arm64 we get the slightly wrong results even with default compiler flags. To get the correct values one has to use -ffp-contract=off which disables FMAs.

Should we maybe give users a build option which disables FMAs?

For everyone interested here is a quick reproducer of the example above

#include "lapacke.h"
#include <stdio.h>

int main() {

  int matrix_layout = LAPACK_COL_MAJOR;
  char balanc = 'P';
  char jobvl = 'N';
  char jobvr = 'N';
  char sense = 'N';
  lapack_int n = 2;
  lapack_int lda = 2;
  lapack_int ldvl = 2;
  lapack_int ldvr = 2;

  double A[n*lda];
  A[0] = 4;
  A[1] = -4;
  A[2] = 1;
  A[3] = 0;

  double WR[n];
  double WI[n];
  double VL[n*ldvl];
  double VR[n*ldvr];

  lapack_int ilo[1];
  lapack_int ihi[1];
  double scale[n];
  double abnrm[1];
  double rconde[1];
  double rcondv[1];

  LAPACKE_dgeevx(matrix_layout, balanc, jobvl, jobvr, sense,
                 n, A, lda, WR, WI,
                 VL, ldvl, VR, ldvr,
                 ilo, ihi, scale, abnrm, rconde, rcondv);

  printf("%1.15f %1.15f", WR[0], WR[1]);

  return 0;
}

The problematic line that causes the accuracy loss is line 253 in dlanv2.f:

B = BB*CS + DD*SN

This expression should evaluate to zero but instead evaluates to -2.5189846806723163E-017 due to different rounding of the FMA instructions. This value is later compare to zero and used inside a SQRT which increases the inaccuracy to ~ 10^-9:

  IF( C.NE.ZERO ) THEN
     IF( B.NE.ZERO ) THEN
        IF( SIGN( ONE, B ).EQ.SIGN( ONE, C ) ) THEN
*
*                    Real eigenvalues: reduce to upper triangular form
*
           SAB = SQRT( ABS( B ) )
           SAC = SQRT( ABS( C ) )
           P = SIGN( SAB*SAC, C )
           TAU = ONE / SQRT( ABS( B+C ) )
           A = TEMP + P
           D = TEMP - P
           .
           .
           .

Maybe an alternative approach would be to compare against machine accuracy instead of zero but that's not my field of expertise.

martin-frbg commented 3 months ago

This is typically seen with the testsuite - #732 , #744 , other issues referenced therein... and it only gets worse when one uses something other than the Reference BLAS that leverages FMA for performance ... but I am not sure if it is still possible to "give a build option that disables FMA" across all platforms and compilers (unless it is an archived copy of gcc3 or similar)

angsch commented 3 months ago

In many cases FMA improves the accuracy because two operations are executed with only a single rounding. However, there are examples where FMA harms accuracy. The example that you give actually coincides with one that Kahan used to illustrate that FMA can lead to a loss of accuracy if not used with care.

Since there are plenty of examples where FMA is beneficial (notably dot product/GEMM), forbidding FMA does not sound like a viable choice. If there are occurrences $x^2 - y^2$ that are known to be critical, maybe the best approach is to add parentheses and prevent the compiler from using FMA in this particular expression?

Maybe an alternative approach would be to compare against machine accuracy instead of zero

I second this.

ACSimon33 commented 2 months ago

If there are occurrences that are known to be critical, maybe the best approach is to add parentheses and prevent the compiler from using FMA in this particular expression?

I like that approach even more. I tested it for my example, and it works. @angsch Should I create a MR with the changed parentheses or do you think the comparison against machine epsilon is better?

angsch commented 2 months ago

I support the fix with parentheses. As far as I know, compilers respect parentheses for all optimization levels except for when non-IEEE compliant/unsafe math operations are enabled (-ffast-math). So the fix will also work in optimized libraries that use FMA and -O3. That leaves the question how to find occurrences that should receive brackets. In the long term, I think that a numerical test is the right way forward. Before spending time on either solution, maybe @langou can share his take on the issue.

langou commented 2 months ago

Thanks a lot for all the details about this issue! Very interesting!

Short version:

My opinion is:

+1 for adding parenthesis
Not excited to change the conditional with numerical values (machine precision instead of zero) at: https://github.com/Reference-LAPACK/lapack/blob/694e3375b0e61c70c98ef9cb46013170cd99f2a5/SRC/dlanv2.f#L261-L263

TL;DR

I am not sure we should consider it problem if the computed eigenvalues in single precision are '[1.999999988777289 2.000000011222711]' (instead of 2 and 2). The output is not "wrong" in a single-precision sense.

In general, using FMA tends to give better accuracy than not using it. I understand that, in this example, FMA deteriorates the accuracy, but I am not convinced that FMA always deteriorates the accuracy in this situation, sometimes it might deteriorate the accuracy (for example @ACSimon33's example), sometimes (maybe) it might improve it. For this example, with these numerical values, it seems that FMA deteriorates the accuracy because it destroys the "symmetry" that we have in the expression BBCS + DDSN and the values in our matrix. But if the values in our matrix are not that symmetric, I would expect FMA to help. (I do not know, this is an assumption.)

So, all in all,

(*) I do not think we should disable FMA with a compiler flag, certainly not fot the whole library, and even for a specific routine, I am leery about it. (Compilation flags change over time, vary across compilers, etc.)

(*) I have mixed feelings about using parenthesis. I agree with when @angsch writes “If there are occurrences that are known to be critical, maybe the best approach is to add parentheses and prevent the compiler from using FMA in this particular expression?” It seems that there are interesting specific cases where there is naturally a symmetry occurring in the input data and so not using FMA would help these instances. (Which in practice do not seem that farfetched.) The potential loss of accuracy for not using FMA for other cases seems acceptable. So yes, I am in favor of using parenthesis to (attempt to) prevent FMA.

(*) As far as using a numerical test with machine precision, I read the code a few times, and, as far as I can see, I am not in favor of it. This does not sound a good idea to me.

ACSimon33 commented 2 months ago

I am not sure we should consider it problem if the computed eigenvalues in single precision are '[1.999999988777289 2.000000011222711]' (instead of 2 and 2). The output is not "wrong" in a single-precision sense.

To clarify, [1.999999988777289 2.000000011222711] is the result in double precision. The result in single precision is [1.999622344970703 2.000377655029297]. The error is always around sqrt(EPS).

More insights:

When using diagonally scaling (BALANC='S' or BALANC='B'), the single precision result is not exact even without FMAs, but the error is around EPS ([1.999999761581421 1.999999761581421]) which is fine. With FMAs and diagonal scaling, the error actually manifests in the imaginary parts of the eigenvalues: [1.999999761581421+0.000318597536534i 1.999999761581421-0.000318597536534i] which might be even more problematic since the eigenvalues should be real with no imaginary part. This is also fixable via the parentheses.

I'll go ahead and create a draft for the MR.

martin-frbg commented 2 months ago

In light of this, may I ask about the current opinion on the similar patches for ?LAHQR proposed in #732 two years ago ?

ACSimon33 commented 2 months ago

I tried to reproduce #732 on the current master, and it seems that the double-precision complex errors are already resolved, and only some of the single-precision errors remain:

                        -->   LAPACK TESTING SUMMARY  <--
                Processing LAPACK Testing output found in the TESTING directory
SUMMARY                 nb test run     numerical error         other error  
================        ===========     =================       ================  
REAL                    1561872         1       (0.000%)        0       (0.000%)
DOUBLE PRECISION        1570470         0       (0.000%)        0       (0.000%)
COMPLEX                 1021282         164     (0.016%)        0       (0.000%)
COMPLEX16               1030797         0       (0.000%)        0       (0.000%)

--> ALL PRECISIONS      5184421         165     (0.003%)        0       (0.000%)

Unfortunately, applying the proposed patches reduces the number of complex errors to only 150 instead of 0. Mind that I'm not testing under the same conditions as in the issue. I'm using GFortran 14.1 on an AMD Epyc CPU (Zen) instead of GFortran 11.3 on Haswell.

Reference-LAPACK / lapack

Less accuracy due to FMAs #1031