flame / blis

BLAS-like Library Instantiation Software Framework
Other
2.33k stars 368 forks source link

LAPACK test segfault on zen/zen2/zen3 at bli_sgemmsup_rd_haswell_asm_1x16n #821

Closed j-bm closed 3 months ago

j-bm commented 4 months ago

Building blis on OpenBSD (-current, that is to say most recent development version).

Configuration argument: x86_64
compiler: clang version 16.0.6
cpu: cpu0: AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics, 3307.99 MHz, 19-74-01

cpu0: cpuid 1 edx=78bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2> ecx=f6f83203<SSE3,PCLMUL,SSSE3,FMA3,CX16,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,HV>

Built LAPACK version 3.8.0, run the test code:

$ LIN/xlintsts < stest.in
 Tests of the REAL LAPACK routines
 LAPACK VERSION 3.8.0

 The following parameter values will be used:
    M   :       0     1     2     3     5    10    50
    N   :       0     1     2     3     5    10    50
    NRHS:       1     2    15
    NB  :       1     3     3     3    20
    NX  :       1     0     5     9     1
    RANK:      30    50    90
...[stuff omitted]...

 All tests for SQT routines passed the threshold (    510 tests run)

 SXQ routines passed the tests of the error exits

Program received signal SIGSEGV, Segmentation fault.
0x000008dcc789a771 in bli_sgemmsup_rd_haswell_asm_1x16n ()
(gdb) bt
#0  0x000008dcc789a771 in bli_sgemmsup_rd_haswell_asm_1x16n ()
#1  0x000008dcc789a21a in bli_sgemmsup_rd_haswell_asm_6x16n ()
#2  0x000008dcc7731b57 in bli_gemmsup_ref_var1n ()
#3  0x000008dcc772df15 in bli_gemmsup_int ()
#4  0x000008dcc7735bba in bli_l3_sup_thread_decorator_entry ()
#5  0x000008dcc7735ae7 in bli_l3_sup_thread_decorator ()
#6  0x000008dcc772c00b in bli_gemmsup_ref ()
#7  0x000008dcc7d341fb in bli_gemmsup ()
#8  0x000008dcc7d32f8b in bli_gemm_ex ()
#9  0x000008dcc7d317c3 in sgemm_ ()
#10 0x000008dcc75fe1ca in slqt05_ ()
#11 0x000008dcc75f073f in schklqtp_ ()
#12 0x000008dcc7574b6f in MAIN__ ()
#13 0x000008dcc7574ddb in main ()

(gdb) x/i $pc
=> 0x186c5369771 <bli_sgemmsup_rd_haswell_asm_1x16n+721>:       vmovss (%rax,%r8,1),%xmm1
(gdb)
   0x186c5369777 <bli_sgemmsup_rd_haswell_asm_1x16n+727>:       add    $0x4,%rax
(gdb)
   0x186c536977b <bli_sgemmsup_rd_haswell_asm_1x16n+731>:       vmovss (%rbx),%xmm3
(gdb)
   0x186c536977f <bli_sgemmsup_rd_haswell_asm_1x16n+735>:       vfmadd231ps %ymm0,%ymm3,%ymm4
(gdb)
   0x186c5369784 <bli_sgemmsup_rd_haswell_asm_1x16n+740>:       vmovss (%rbx,%r11,1),%xmm3
(gdb)
   0x186c536978a <bli_sgemmsup_rd_haswell_asm_1x16n+746>:       vfmadd231ps %ymm0,%ymm3,%ymm7
(gdb)
   0x186c536978f <bli_sgemmsup_rd_haswell_asm_1x16n+751>:       vmovss (%rbx,%r11,2),%xmm3
(gdb)
   0x186c5369795 <bli_sgemmsup_rd_haswell_asm_1x16n+757>:       vfmadd231ps %ymm0,%ymm3,%ymm10

Experimenting with the $ export BLIS_ARCH_TYPE= yields the conclusion zen/zen2/zen3 fails exactly as above. BLIS_ARCH_TYPE=4 (sandybridge) succeeds, as does Penryn.

It seems to be the SQZ and STQ tests that fail.

devinamatthews commented 4 months ago

Would it be possible to extract the specific sgemm parameters leading to this in order to create a MWE?

j-bm commented 4 months ago

Just a remark -- I deleted my last two comments as the test code in them was incorrect.

Better code to come!

j-bm commented 4 months ago

Here is a test code with some assertions included.

$ cat sgemmtest.f90
program sgemmtest
   IMPLICIT NONE

   REAL, ALLOCATABLE ::  Q(:, :), A(:, :), R(:, :)

   REAL ONE, ZERO
   PARAMETER(ONE=1.0, ZERO=0.0)

   INTEGER L, M, N

   INTRINSIC MAX, MIN

   M = 50
   N = 10

   L = MAX(M, N, 1)

   ALLOCATE (Q(L, L), A(M, N), R(M, L))

   CALL SLASET('A', M, N, ONE, ONE, A, M)
   CALL SLASET('A', L, L, ONE, ONE, Q, L)

   print *,'sgemmtest:'
   print *,'   M = ',M,' N = ',N,' L = ',L
   print *,'   A(1,1) is ',A(1,1),' Q(1,1) is ',Q(1,1)
   print *,' '
   print *,' R = Q**T * A, except Q is square but we use MxN of it'
   print *,' assert sum(A)==M*N*ONE is ', M*N*ONE == SUM(A)
   print *,' assert sum(Q)==L*L*ONE is ', L*L*ONE == SUM(Q)
   CALL SGEMM('T', 'N', M, N, M, ONE, Q, M, A, M, ZERO, R, M)
   print *,' r11 ',r(1,1), ' r211', r(2,1), ' rml ',r(M,L)
   print *,' matrix of MxN filled with M:'
   print *,' assert sum(R)==M*N*M is ', SUM(R) - M*N*M*ONE == ZERO

   print *,' done'
end

Here is a successful run:

$ export GFORTRAN_UNBUFFERED_ALL=1
$ export MALLOC_OPTIONS=CFG
$ export BLIS_ARCH_DEBUG=1
$ ./tblis.x
 sgemmtest:
    M =           50  N =           10  L =           50
    A(1,1) is    1.00000000      Q(1,1) is    1.00000000

  R = Q**T * A, except Q is square but we use MxN of it
  assert sum(A)==M*N*ONE is  T
  assert sum(Q)==L*L*ONE is  T
libblis: selecting sub-configuration 'zen3'.
  r11    50.0000000      r211   50.0000000      rml    0.00000000
  matrix of MxN filled with M:
  assert sum(R)==M*N*M is  T
  done

Here is unsuccessful run:

$ egdb -q tblis.x
Reading symbols from tblis.x...
(gdb) run
Starting program: /home/jal/checkblis/tblis.x
 sgemmtest:
    M =           50  N =           10  L =           50
    A(1,1) is    1.00000000      Q(1,1) is    1.00000000

  R = Q**T * A, except Q is square but we use MxN of it
  assert sum(A)==M*N*ONE is  T
  assert sum(Q)==L*L*ONE is  T
libblis: selecting sub-configuration 'zen3'.

Program received signal SIGSEGV, Segmentation fault.
0x00000aad13098881 in bli_sgemmsup_rd_haswell_asm_1x16n ()
   from /usr/local/lib/libblis.so.0.0

(gdb) bt
#0  0x00000aad13098881 in bli_sgemmsup_rd_haswell_asm_1x16n ()
   from /usr/local/lib/libblis.so.0.0
#1  0x00000aad1309832a in bli_sgemmsup_rd_haswell_asm_6x16n ()
   from /usr/local/lib/libblis.so.0.0
#2  0x00000aad136fac87 in bli_gemmsup_ref_var1n ()
   from /usr/local/lib/libblis.so.0.0
#3  0x00000aad136f8a55 in bli_gemmsup_int ()
   from /usr/local/lib/libblis.so.0.0
#4  0x00000aad136f86aa in bli_l3_sup_thread_decorator_entry ()
   from /usr/local/lib/libblis.so.0.0
#5  0x00000aad136f85d7 in bli_l3_sup_thread_decorator ()
   from /usr/local/lib/libblis.so.0.0
#6  0x00000aad136f9f7b in bli_gemmsup_ref ()
   from /usr/local/lib/libblis.so.0.0
#7  0x00000aad136f82db in bli_gemmsup () from /usr/local/lib/libblis.so.0.0
#8  0x00000aad136f6a1b in bli_gemm_ex () from /usr/local/lib/libblis.so.0.0
#9  0x00000aad137464e3 in sgemm_ () from /usr/local/lib/libblis.so.0.0
#10 0x00000aab12ff8ca5 in sgemmtest () at sgemmtest.f90:30
#11 0x00000aab12ff9059 in main (argc=1, argv=0x7571737f3dd0)
    at sgemmtest.f90:36
#12 0x00000aab12ff7e7b in _start ()
devinamatthews commented 4 months ago

@fgvanzee since you're the most familiar with this code, could you take a look?

fgvanzee commented 4 months ago

@devinamatthews Sure, I'll see what I can figure out.

fgvanzee commented 4 months ago

@j-bm I used your Fortran driver (thanks for providing that!), but was unable to reproduce your issue. :-\

  1. What version/commit of BLIS are you using?
  2. Is it vanilla or from AMD?
  3. Assuming you built from source, how did you configure your copy of BLIS?
j-bm commented 4 months ago

This is running flame/blis version 1.0.0 .tar.gz file.

Built on OpenBSD-current (as of a few weeks ago) using

o75snap$ egfortran --version
GNU Fortran (GCC) 8.4.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

o75snap$ cc --version
OpenBSD clang version 16.0.6
Target: amd64-unknown-openbsd7.5
Thread model: posix
InstalledDir: /usr/bin

$ gmake showconfig
configuration family:       x86_64
sub-configurations:         skx knl haswell sandybridge penryn zen3 zen2 zen excavator steamroller piledriver bulldozer generic
requisite kernels sets:     skx knl sandybridge penryn zen3 zen2 haswell zen piledriver bulldozer generic
kernel-to-config map:       bulldozer:bulldozer generic:generic haswell:haswell knl:knl penryn:penryn piledriver:piledriver sandybridge:sandybridge skx:skx zen:zen zen2:zen2 zen3:zen3
-------------------------
BLIS version string:        1.0
.so major version:          0
.so minor.build vers:       0
install libdir:             /usr/local/lib
install includedir:         /usr/local/include
install sharedir:           /usr/local/share
debugging status:           off
enable AddressSanitizer?    no
enabled threading model(s): single
enable BLAS API?            yes
enable CBLAS API?           yes
build static library?       yes
build shared library?       yes
ARG_MAX hack enabled?       no

I rebuilt on another cpu (Intel i3-series instead of Ryzen) but had the same issues at the same routine. Both are Windows10/11 running VMware so all my tests are on a VM guest not actual hardware.

Some further notes and trials:

  1. Only happens on OpenBSD. Could not reproduce on Linux (MX 23.3 which is debian based).

  2. Only, it seems, running debugging/testing malloc (defined by MALLOC_OPTIONS=CFG) on this test program.

  3. The original test with the LAPACK program LIN/xlintsts resulted in segfaults with or without this memory allocation check.

  4. I tried the Microsoft mimalloc on MX Linux, thinking it was some kind of malloc issue not blis/fortran/os. No segfault found. Mimalloc does not have the same testing/debugging features as OpenBSD malloc.

  5. Segfault occurs about one third of the time on the test program. The LAPACK program LIN/xlintsts failures occur at different subtests and the program does not ever run to completion, because it has so many subtests.

  6. Some speculation on what could be happening:

j-bm commented 4 months ago

Here is a debugging run with a modified test program which prints the address of the allocated arrays.

code fragment:

24         print *,'malloctest:'
25         write (*,'(A,Z16)') '  location A is ', loc(A)
26         write (*,'(A,Z16)') '  location Q is ', loc(Q)
27         write (*,'(A,Z16)') '  location R is ', loc(R)
28         CALL SGEMM('T', 'N', M, N, M, ONE, Q, M, A, M, ZERO, R, M)

debugger output:

(gdb) break *bli_sgemmsup_rd_haswell_asm_1x16n+721
Breakpoint 3 at 0xd826ecae9a1
(gdb) c
Continuing.
 malloctest:
  location A is      D82E68897D0
  location Q is      D8214FBA000
  location R is      D822BE43000
libblis: selecting sub-configuration 'zen3'.

Breakpoint 3, 0x00000d826ecae9a1 in bli_sgemmsup_rd_haswell_asm_1x16n ()
   from /usr/local/lib/libblis.so.0.0
(gdb) x/i $pc
=> 0xd826ecae9a1 <bli_sgemmsup_rd_haswell_asm_1x16n+721>:
    vmovss (%rax,%r8,1),%xmm1
(gdb) info reg r8 rax rsp
r8             0xc8                200
rax            0xd82e6889f98       14855864623000
rsp            0x734c53fa1830      0x734c53fa1830
(gdb) p 0xd82e6889f98 - 0xD82E68897D0
$1 = 1992

This shows that $rax is pointing at the last element of array A.

(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x00000d826ecae9a1 in bli_sgemmsup_rd_haswell_asm_1x16n ()
   from /usr/local/lib/libblis.so.0.0
(gdb) p/x *(0xD82E68897D0)
$2 = 0x3f800000
(gdb) p/x *($rax)
$3 = 0x3f800000

The dereference fails:

(gdb) p/x *($rax+$r8)
Cannot access memory at address 0xd82e688a060

Which suggests a bug, accessing beyond the end of array A.

fgvanzee commented 3 months ago

@j-bm Thank you for those additional details, they were quite helpful! I think you helped us narrow it down to the last phase of the edge case handling code in the offending s1x16n kernel.

In kernels/haswell/3/sup/bli_gemmsup_rd_haswell_asm_s6x16n.c, line 2215 2214 appears to not belong there and should be deleted. (You can see it on line 1708 in the s2x16n version of the kernel; so this is very likely a copy-paste bug.)

Please try deleting this line and let us know if it fixes the bug.

    label(.SLOOPKLEFT1)                // EDGE LOOP (scalar)
                                       // NOTE: We must use ymm registers here bc
                                       // using the xmm registers would zero out the
                                       // high bits of the destination registers,
                                       // which would destory intermediate results.

    vmovss(mem(rax       ), xmm0)
    vmovss(mem(rax, r8, 1), xmm1)     // ***TRY DELETING THIS LINE
    add(imm(1*4), rax)                 // a += 1*cs_a = 1*4;

    vmovss(mem(rbx        ), xmm3)
    vfmadd231ps(ymm0, ymm3, ymm4)

    vmovss(mem(rbx, r11, 1), xmm3)
    vfmadd231ps(ymm0, ymm3, ymm7)

    vmovss(mem(rbx, r11, 2), xmm3)
    vfmadd231ps(ymm0, ymm3, ymm10)

    vmovss(mem(rbx, r13, 1), xmm3)
    add(imm(1*4), rbx)                 // b += 1*rs_b = 1*4;
    vfmadd231ps(ymm0, ymm3, ymm13)

    dec(rsi)                           // i -= 1;
    jne(.SLOOPKLEFT1)                  // iterate again if i != 0.
j-bm commented 3 months ago

Yes, that fixes the issue.

Did that extra instruction do anything important (other that segfaulting)?

fgvanzee commented 3 months ago

Yes, that fixes the issue.

Great news. Thanks for your help!

Did that extra instruction do anything important (other that segfaulting)?

No, it was 100% a copy-paste bug. I probably started with the 2x16 case and deleted instructions until it became a 1x16, but then forgot to delete that instruction (which would have loaded the second of the two elements of A).

I'll open a PR with the fix and credit you. I really appreciate your feedback!

fgvanzee commented 3 months ago

@j-bm Sorry for getting the line numbers a little wrong btw. I had forgotten that I had inserted printf() calls to signal the entry into and exit from that function (as a sanity check to make sure the right code was being called).

j-bm commented 3 months ago

Thanks for the quick fix!

fgvanzee commented 3 months ago

I'm going to close this issue now. If you encounter any further problems or concerns, please let us know.

BhaskarNallani commented 3 months ago

Hi @fgvanzee , Creating memory for input matrixes with simple malloc( ) which creates exact size without any alignment for functionality test helps out to find these out of order memory accesses. In addition to that ASAN testing helps further.

fgvanzee commented 3 months ago

I completely agree. Thanks for that reminder, @BhaskarNallani!