Closed happyalu closed 5 months ago
nothing that would be already known,I think. If similar characteristics as #189 - SVE kernels for DOT were added in 0.3.22 and last modified in 0.3.26, but that would not explain NeoverseN1&2. (In fact the only difference in kernel inventory between the two is in the NRM2 kernels - dznrm2_thunderx2t99.c has seen repeat changes in 0.3.24 to avoid the generation of NANs when the input data contained +/-Infinity (though I believe most of the "critical" work in that regard happened before 0.3.21)
Is it possible to get an isolated reproducer, or would one have to build kaldi and run its testsuite ?
A simple reproducer would indeed be great
You could try changing the NRM2 kernels in KERNEL.NEOVERSEN2 and KERNEL.ARMV8SVE to
SNRM2KERNEL = nrm2.S
DNRM2KERNEL = nrm2.S
CNRM2KERNEL = znrm2.S
ZNRM2KERNEL = znrm2.S
as used in KERNEL.NEOVERSEN1 to see if my guess is correct... in particular the CNRM2KERNEL and ZNRM2KERNEL lines
Thanks, Martin.
I ran some tests overnight. It seems v0.3.22 and v0.3.25 also work without the NaN issue. Whereas v0.3.26 has the NaN issue.
I'll run the test you suggested today and report on it.
Just patching the KERNEL.* files and rebuilding should be enough, right? For some reason that doesn't seem to be fixing the NaN.
Still trying for a small case, but it's being a bit hard to construct it.
Yes, this should be sufficient. (But maybe the issue is not in NRM2 - although that is the only relevant difference between the Neoverse N1 and N2). Or we are chasing an uninitialized variable or register, and the NAN comes and goes depending on previous content :(
I built openblas on the host directly (i was cross compiling before), and it seems the DDOT test is failing, which might explain the kaldi error (looking at where it happens in the code, it's definitely in DOT).
./ctest/xdcblat1
Real CBLAS Test Program Results
Test of subprogram number 1 CBLAS_DDOT FAIL
CASE N INCX INCY MODE I COMP(I) TRUE(I) DIFFERENCE SIZE(I)
1 4 2 -2 9999 1 0.48000000 0.85000000 -0.3700 3.2000
1 4 -2 1 9999 1 -0.68000000 -0.74000000 0.0600 3.2000
1 4 -1 -2 9999 1 0.64000000 1.27000000 -0.6300 3.2000
Test of subprogram number 2 CBLAS_DAXPY ----- PASS -----
Test of subprogram number 3 CBLAS_DROTG ----- PASS -----
Test of subprogram number 4 CBLAS_DROT ----- PASS -----
Test of subprogram number 5 CBLAS_DCOPY ----- PASS -----
Test of subprogram number 6 CBLAS_DSWAP ----- PASS -----
Test of subprogram number 7 CBLAS_DNRM2 ----- PASS -----
Test of subprogram number 8 CBLAS_DASUM ----- PASS -----
Test of subprogram number 9 CBLAS_DSCAL ----- PASS -----
Test of subprogram number 10 CBLAS_IDAMAX ----- PASS -----
Test of subprogram number 11 CBLAS_DROTM ----- PASS -----
I had built it as follows:
TARGET=NEOVERSEN2 USE_THREAD=0 USE_LOCKING=1 USE_OPENMP=0 NO_FORTRAN=1 NUM_THREADS=256 DEBUG=1 make
This is with gcc 13.2 on a nixos machine.
Oh, ok. The SVE kernel for DDOT was indeed modified in 0.3.26. Strange how this did not come up in CI.
Our CI still uses gcc 11.4 and runs on V1 (c7g instance), but the ddot kernel is pure inline assembly...
I'm using r7g, but looking at the aws docs it doesn't seem it should be different than c7g. (?)
I tested a build with GCC 11.4, and get the same failure on that r7g host.
Could this perhaps be a case of misrecognized CPU at runtime with DYNAMIC_ARCH on?
> lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: ARM
BIOS Vendor ID: AWS
Model name: Neoverse-V1
BIOS Model name: AWS Graviton3 AWS Graviton3 CPU @ 2.6GHz
BIOS CPU family: 257
Model: 1
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Stepping: r1p1
BogoMIPS: 2100.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpu
id asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve
asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp svei8mm svebf
16 i8mm bf16 dgh rng
Caches (sum of all):
L1d: 256 KiB (4 instances)
L1i: 256 KiB (4 instances)
L2: 4 MiB (4 instances)
L3: 32 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-3
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; __user pointer sanitization
Spectre v2: Mitigation; CSV2, BHB
Srbds: Not affected
Tsx async abort: Not affected
Misrecognized cpu should end up in NeoverseN1, or worst case the "generic ARMV8" target, so unlikely. (And so far I cannot reproduce your DDOT error with gcc12)
i see. ok, perhaps a silly question since it's in pure assembly, but Could it be related to my use of NO_FORTRAN=1
?
that is about the only option I have not tried yet, if true it would suggest a miscompilation of the ddot test rather than the actual BLAS kernel or library. (As in the NO_FORTRAN case, f2c-generated C files would be substituted for the Fortran sources - but these are limited to the imported Reference-LAPACK and the test/ctest folders that stem from the Reference BLAS).
Tried running it on a c7g instance too. DDOT failed there too.
I'm not sure why 0.3.25 would work with NO_FORTRAN though.
Hmm... DDOT works when I leave the target name as NeoverseV1, but it fails when I compile for TARGET=NEOVERSEN2 on exactly the same hardware. Both targets use the same dot.c kernel that employs dot_kernel_sve.c (the inline assembly) when HAVE_SVE is defined (which it is for both targets). About the only difference that I can see is in the -march/-mtune
flags given to the compiler - but I do not think these should matter ???
One strangeness is that the inline assembly lacks a clobber list, should this be leading the compiler astray ?
Git bisect tells me that this is the first bad commit: 7a4fef4f604db2a5e4e0c4ffaebea220a0646ab1
Thanks - sort-of makes sense as it is about the only recent change, and making something simpler&faster is always a bit dangerous... still do not understand how this leads to failure on one platform and not another closely related...
Oh good,with the added clobbers the test passes now. Need to double-check that I did not change anything else...
Thank you!! I'll test with that commit and build the whole application including kaldi to see what happens.
Built kaldi (and the rest of the app) and the bug appears to be fixed with this change. I tried running it NEOVERSE{N,V}{1,2} as the OPENBLAS_CORETYPE and the app was stable.
Thank you so much for such a quick fix! I'll look forward to the next release.
Well, thank you for your help,and for confirming that the fix actually works
I recently ran into this issue with Kaldi's NAN assert check failing after I updated openblas: https://github.com/kaldi-asr/kaldi/blob/67548a31c45f93d8b25ee553c5969d6a6d5d9408/src/feat/mel-computations.cc#L239-L242
CPU causing the issue: AWS Graviton3 (r7g instance)
Seems to be similar to #189
Version known not affected: 0.3.21 Version known to be affected: 0.3.27
With 0.3.27, on an r7g instance, setting the coretype has these results:
I think there are two things I should do here:
I'll try to get these and report here, but just in case there might be a usual suspect, I thought I'll report this first.