OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.43k stars 1.51k forks source link

Kaldi's NAN assert on Graviton3 #4750

Closed happyalu closed 5 months ago

happyalu commented 5 months ago

I recently ran into this issue with Kaldi's NAN assert check failing after I updated openblas: https://github.com/kaldi-asr/kaldi/blob/67548a31c45f93d8b25ee553c5969d6a6d5d9408/src/feat/mel-computations.cc#L239-L242

CPU causing the issue: AWS Graviton3 (r7g instance)

Seems to be similar to #189

Version known not affected: 0.3.21 Version known to be affected: 0.3.27

With 0.3.27, on an r7g instance, setting the coretype has these results:

I think there are two things I should do here:

I'll try to get these and report here, but just in case there might be a usual suspect, I thought I'll report this first.

martin-frbg commented 5 months ago

nothing that would be already known,I think. If similar characteristics as #189 - SVE kernels for DOT were added in 0.3.22 and last modified in 0.3.26, but that would not explain NeoverseN1&2. (In fact the only difference in kernel inventory between the two is in the NRM2 kernels - dznrm2_thunderx2t99.c has seen repeat changes in 0.3.24 to avoid the generation of NANs when the input data contained +/-Infinity (though I believe most of the "critical" work in that regard happened before 0.3.21) Is it possible to get an isolated reproducer, or would one have to build kaldi and run its testsuite ? A simple reproducer would indeed be great

martin-frbg commented 5 months ago

You could try changing the NRM2 kernels in KERNEL.NEOVERSEN2 and KERNEL.ARMV8SVE to

SNRM2KERNEL    = nrm2.S
DNRM2KERNEL    = nrm2.S
CNRM2KERNEL    = znrm2.S
ZNRM2KERNEL    = znrm2.S

as used in KERNEL.NEOVERSEN1 to see if my guess is correct... in particular the CNRM2KERNEL and ZNRM2KERNEL lines

happyalu commented 5 months ago

Thanks, Martin.

I ran some tests overnight. It seems v0.3.22 and v0.3.25 also work without the NaN issue. Whereas v0.3.26 has the NaN issue.

I'll run the test you suggested today and report on it.

happyalu commented 5 months ago

Just patching the KERNEL.* files and rebuilding should be enough, right? For some reason that doesn't seem to be fixing the NaN.

Still trying for a small case, but it's being a bit hard to construct it.

martin-frbg commented 5 months ago

Yes, this should be sufficient. (But maybe the issue is not in NRM2 - although that is the only relevant difference between the Neoverse N1 and N2). Or we are chasing an uninitialized variable or register, and the NAN comes and goes depending on previous content :(

happyalu commented 5 months ago

I built openblas on the host directly (i was cross compiling before), and it seems the DDOT test is failing, which might explain the kaldi error (looking at where it happens in the code, it's definitely in DOT).

./ctest/xdcblat1
Real CBLAS Test Program Results
Test of subprogram number   1            CBLAS_DDOT                                         FAIL
CASE  N INCX INCY MODE  I                           COMP(I)                             TRUE(I)  DIFFERENCE     SIZE(I)
   1   4     2    -2  9999   1                           0.48000000                           0.85000000      -0.3700       3.2000
   1   4    -2     1  9999   1                          -0.68000000                          -0.74000000       0.0600       3.2000
   1   4    -1    -2  9999   1                           0.64000000                           1.27000000      -0.6300       3.2000
Test of subprogram number   2            CBLAS_DAXPY                                     ----- PASS -----
Test of subprogram number   3            CBLAS_DROTG                                     ----- PASS -----
Test of subprogram number   4            CBLAS_DROT                                      ----- PASS -----
Test of subprogram number   5            CBLAS_DCOPY                                     ----- PASS -----
Test of subprogram number   6            CBLAS_DSWAP                                     ----- PASS -----
Test of subprogram number   7            CBLAS_DNRM2                                     ----- PASS -----
Test of subprogram number   8            CBLAS_DASUM                                     ----- PASS -----
Test of subprogram number   9            CBLAS_DSCAL                                     ----- PASS -----
Test of subprogram number  10            CBLAS_IDAMAX                                    ----- PASS -----
Test of subprogram number  11            CBLAS_DROTM                                     ----- PASS -----

I had built it as follows: TARGET=NEOVERSEN2 USE_THREAD=0 USE_LOCKING=1 USE_OPENMP=0 NO_FORTRAN=1 NUM_THREADS=256 DEBUG=1 make

This is with gcc 13.2 on a nixos machine.

martin-frbg commented 5 months ago

Oh, ok. The SVE kernel for DDOT was indeed modified in 0.3.26. Strange how this did not come up in CI.

martin-frbg commented 5 months ago

Our CI still uses gcc 11.4 and runs on V1 (c7g instance), but the ddot kernel is pure inline assembly...

happyalu commented 5 months ago

I'm using r7g, but looking at the aws docs it doesn't seem it should be different than c7g. (?)

happyalu commented 5 months ago

I tested a build with GCC 11.4, and get the same failure on that r7g host.

Could this perhaps be a case of misrecognized CPU at runtime with DYNAMIC_ARCH on?

> lscpu
Architecture:             aarch64
  CPU op-mode(s):         32-bit, 64-bit
  Byte Order:             Little Endian
CPU(s):                   4
  On-line CPU(s) list:    0-3
Vendor ID:                ARM
  BIOS Vendor ID:         AWS
  Model name:             Neoverse-V1
    BIOS Model name:      AWS Graviton3 AWS Graviton3 CPU @ 2.6GHz
    BIOS CPU family:      257
    Model:                1
    Thread(s) per core:   1
    Core(s) per socket:   4
    Socket(s):            1
    Stepping:             r1p1
    BogoMIPS:             2100.00
    Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpu
                          id asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve 
                          asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp svei8mm svebf
                          16 i8mm bf16 dgh rng
Caches (sum of all):      
  L1d:                    256 KiB (4 instances)
  L1i:                    256 KiB (4 instances)
  L2:                     4 MiB (4 instances)
  L3:                     32 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-3
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; __user pointer sanitization
  Spectre v2:             Mitigation; CSV2, BHB
  Srbds:                  Not affected
  Tsx async abort:        Not affected
martin-frbg commented 5 months ago

Misrecognized cpu should end up in NeoverseN1, or worst case the "generic ARMV8" target, so unlikely. (And so far I cannot reproduce your DDOT error with gcc12)

happyalu commented 5 months ago

i see. ok, perhaps a silly question since it's in pure assembly, but Could it be related to my use of NO_FORTRAN=1?

martin-frbg commented 5 months ago

that is about the only option I have not tried yet, if true it would suggest a miscompilation of the ddot test rather than the actual BLAS kernel or library. (As in the NO_FORTRAN case, f2c-generated C files would be substituted for the Fortran sources - but these are limited to the imported Reference-LAPACK and the test/ctest folders that stem from the Reference BLAS).

happyalu commented 5 months ago

Tried running it on a c7g instance too. DDOT failed there too.

I'm not sure why 0.3.25 would work with NO_FORTRAN though.

martin-frbg commented 5 months ago

Hmm... DDOT works when I leave the target name as NeoverseV1, but it fails when I compile for TARGET=NEOVERSEN2 on exactly the same hardware. Both targets use the same dot.c kernel that employs dot_kernel_sve.c (the inline assembly) when HAVE_SVE is defined (which it is for both targets). About the only difference that I can see is in the -march/-mtune flags given to the compiler - but I do not think these should matter ??? One strangeness is that the inline assembly lacks a clobber list, should this be leading the compiler astray ?

happyalu commented 5 months ago

Git bisect tells me that this is the first bad commit: 7a4fef4f604db2a5e4e0c4ffaebea220a0646ab1

martin-frbg commented 5 months ago

Thanks - sort-of makes sense as it is about the only recent change, and making something simpler&faster is always a bit dangerous... still do not understand how this leads to failure on one platform and not another closely related...

martin-frbg commented 5 months ago

Oh good,with the added clobbers the test passes now. Need to double-check that I did not change anything else...

happyalu commented 5 months ago

Thank you!! I'll test with that commit and build the whole application including kaldi to see what happens.

happyalu commented 5 months ago

Built kaldi (and the rest of the app) and the bug appears to be fixed with this change. I tried running it NEOVERSE{N,V}{1,2} as the OPENBLAS_CORETYPE and the app was stable.

Thank you so much for such a quick fix! I'll look forward to the next release.

martin-frbg commented 5 months ago

Well, thank you for your help,and for confirming that the fix actually works