Closed j-bm closed 3 months ago
Would it be possible to extract the specific sgemm
parameters leading to this in order to create a MWE?
Just a remark -- I deleted my last two comments as the test code in them was incorrect.
Better code to come!
Here is a test code with some assertions included.
$ cat sgemmtest.f90
program sgemmtest
IMPLICIT NONE
REAL, ALLOCATABLE :: Q(:, :), A(:, :), R(:, :)
REAL ONE, ZERO
PARAMETER(ONE=1.0, ZERO=0.0)
INTEGER L, M, N
INTRINSIC MAX, MIN
M = 50
N = 10
L = MAX(M, N, 1)
ALLOCATE (Q(L, L), A(M, N), R(M, L))
CALL SLASET('A', M, N, ONE, ONE, A, M)
CALL SLASET('A', L, L, ONE, ONE, Q, L)
print *,'sgemmtest:'
print *,' M = ',M,' N = ',N,' L = ',L
print *,' A(1,1) is ',A(1,1),' Q(1,1) is ',Q(1,1)
print *,' '
print *,' R = Q**T * A, except Q is square but we use MxN of it'
print *,' assert sum(A)==M*N*ONE is ', M*N*ONE == SUM(A)
print *,' assert sum(Q)==L*L*ONE is ', L*L*ONE == SUM(Q)
CALL SGEMM('T', 'N', M, N, M, ONE, Q, M, A, M, ZERO, R, M)
print *,' r11 ',r(1,1), ' r211', r(2,1), ' rml ',r(M,L)
print *,' matrix of MxN filled with M:'
print *,' assert sum(R)==M*N*M is ', SUM(R) - M*N*M*ONE == ZERO
print *,' done'
end
Here is a successful run:
$ export GFORTRAN_UNBUFFERED_ALL=1
$ export MALLOC_OPTIONS=CFG
$ export BLIS_ARCH_DEBUG=1
$ ./tblis.x
sgemmtest:
M = 50 N = 10 L = 50
A(1,1) is 1.00000000 Q(1,1) is 1.00000000
R = Q**T * A, except Q is square but we use MxN of it
assert sum(A)==M*N*ONE is T
assert sum(Q)==L*L*ONE is T
libblis: selecting sub-configuration 'zen3'.
r11 50.0000000 r211 50.0000000 rml 0.00000000
matrix of MxN filled with M:
assert sum(R)==M*N*M is T
done
Here is unsuccessful run:
$ egdb -q tblis.x
Reading symbols from tblis.x...
(gdb) run
Starting program: /home/jal/checkblis/tblis.x
sgemmtest:
M = 50 N = 10 L = 50
A(1,1) is 1.00000000 Q(1,1) is 1.00000000
R = Q**T * A, except Q is square but we use MxN of it
assert sum(A)==M*N*ONE is T
assert sum(Q)==L*L*ONE is T
libblis: selecting sub-configuration 'zen3'.
Program received signal SIGSEGV, Segmentation fault.
0x00000aad13098881 in bli_sgemmsup_rd_haswell_asm_1x16n ()
from /usr/local/lib/libblis.so.0.0
(gdb) bt
#0 0x00000aad13098881 in bli_sgemmsup_rd_haswell_asm_1x16n ()
from /usr/local/lib/libblis.so.0.0
#1 0x00000aad1309832a in bli_sgemmsup_rd_haswell_asm_6x16n ()
from /usr/local/lib/libblis.so.0.0
#2 0x00000aad136fac87 in bli_gemmsup_ref_var1n ()
from /usr/local/lib/libblis.so.0.0
#3 0x00000aad136f8a55 in bli_gemmsup_int ()
from /usr/local/lib/libblis.so.0.0
#4 0x00000aad136f86aa in bli_l3_sup_thread_decorator_entry ()
from /usr/local/lib/libblis.so.0.0
#5 0x00000aad136f85d7 in bli_l3_sup_thread_decorator ()
from /usr/local/lib/libblis.so.0.0
#6 0x00000aad136f9f7b in bli_gemmsup_ref ()
from /usr/local/lib/libblis.so.0.0
#7 0x00000aad136f82db in bli_gemmsup () from /usr/local/lib/libblis.so.0.0
#8 0x00000aad136f6a1b in bli_gemm_ex () from /usr/local/lib/libblis.so.0.0
#9 0x00000aad137464e3 in sgemm_ () from /usr/local/lib/libblis.so.0.0
#10 0x00000aab12ff8ca5 in sgemmtest () at sgemmtest.f90:30
#11 0x00000aab12ff9059 in main (argc=1, argv=0x7571737f3dd0)
at sgemmtest.f90:36
#12 0x00000aab12ff7e7b in _start ()
@fgvanzee since you're the most familiar with this code, could you take a look?
@devinamatthews Sure, I'll see what I can figure out.
@j-bm I used your Fortran driver (thanks for providing that!), but was unable to reproduce your issue. :-\
This is running flame/blis version 1.0.0 .tar.gz file.
Built on OpenBSD-current (as of a few weeks ago) using
o75snap$ egfortran --version
GNU Fortran (GCC) 8.4.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
o75snap$ cc --version
OpenBSD clang version 16.0.6
Target: amd64-unknown-openbsd7.5
Thread model: posix
InstalledDir: /usr/bin
$ gmake showconfig
configuration family: x86_64
sub-configurations: skx knl haswell sandybridge penryn zen3 zen2 zen excavator steamroller piledriver bulldozer generic
requisite kernels sets: skx knl sandybridge penryn zen3 zen2 haswell zen piledriver bulldozer generic
kernel-to-config map: bulldozer:bulldozer generic:generic haswell:haswell knl:knl penryn:penryn piledriver:piledriver sandybridge:sandybridge skx:skx zen:zen zen2:zen2 zen3:zen3
-------------------------
BLIS version string: 1.0
.so major version: 0
.so minor.build vers: 0
install libdir: /usr/local/lib
install includedir: /usr/local/include
install sharedir: /usr/local/share
debugging status: off
enable AddressSanitizer? no
enabled threading model(s): single
enable BLAS API? yes
enable CBLAS API? yes
build static library? yes
build shared library? yes
ARG_MAX hack enabled? no
I rebuilt on another cpu (Intel i3-series instead of Ryzen) but had the same issues at the same routine. Both are Windows10/11 running VMware so all my tests are on a VM guest not actual hardware.
Some further notes and trials:
Only happens on OpenBSD. Could not reproduce on Linux (MX 23.3 which is debian based).
Only, it seems, running debugging/testing malloc (defined by MALLOC_OPTIONS=CFG) on this test program.
The original test with the LAPACK program LIN/xlintsts resulted in segfaults with or without this memory allocation check.
I tried the Microsoft mimalloc on MX Linux, thinking it was some kind of malloc issue not blis/fortran/os. No segfault found. Mimalloc does not have the same testing/debugging features as OpenBSD malloc.
Segfault occurs about one third of the time on the test program. The LAPACK program LIN/xlintsts failures occur at different subtests and the program does not ever run to completion, because it has so many subtests.
Some speculation on what could be happening:
Here is a debugging run with a modified test program which prints the address of the allocated arrays.
code fragment:
24 print *,'malloctest:'
25 write (*,'(A,Z16)') ' location A is ', loc(A)
26 write (*,'(A,Z16)') ' location Q is ', loc(Q)
27 write (*,'(A,Z16)') ' location R is ', loc(R)
28 CALL SGEMM('T', 'N', M, N, M, ONE, Q, M, A, M, ZERO, R, M)
debugger output:
(gdb) break *bli_sgemmsup_rd_haswell_asm_1x16n+721
Breakpoint 3 at 0xd826ecae9a1
(gdb) c
Continuing.
malloctest:
location A is D82E68897D0
location Q is D8214FBA000
location R is D822BE43000
libblis: selecting sub-configuration 'zen3'.
Breakpoint 3, 0x00000d826ecae9a1 in bli_sgemmsup_rd_haswell_asm_1x16n ()
from /usr/local/lib/libblis.so.0.0
(gdb) x/i $pc
=> 0xd826ecae9a1 <bli_sgemmsup_rd_haswell_asm_1x16n+721>:
vmovss (%rax,%r8,1),%xmm1
(gdb) info reg r8 rax rsp
r8 0xc8 200
rax 0xd82e6889f98 14855864623000
rsp 0x734c53fa1830 0x734c53fa1830
(gdb) p 0xd82e6889f98 - 0xD82E68897D0
$1 = 1992
This shows that $rax is pointing at the last element of array A.
(gdb) c
Continuing.
Program received signal SIGSEGV, Segmentation fault.
0x00000d826ecae9a1 in bli_sgemmsup_rd_haswell_asm_1x16n ()
from /usr/local/lib/libblis.so.0.0
(gdb) p/x *(0xD82E68897D0)
$2 = 0x3f800000
(gdb) p/x *($rax)
$3 = 0x3f800000
The dereference fails:
(gdb) p/x *($rax+$r8)
Cannot access memory at address 0xd82e688a060
Which suggests a bug, accessing beyond the end of array A.
@j-bm Thank you for those additional details, they were quite helpful! I think you helped us narrow it down to the last phase of the edge case handling code in the offending s1x16n kernel.
In kernels/haswell/3/sup/bli_gemmsup_rd_haswell_asm_s6x16n.c
, line 2215 2214 appears to not belong there and should be deleted. (You can see it on line 1708 in the s2x16n version of the kernel; so this is very likely a copy-paste bug.)
Please try deleting this line and let us know if it fixes the bug.
label(.SLOOPKLEFT1) // EDGE LOOP (scalar)
// NOTE: We must use ymm registers here bc
// using the xmm registers would zero out the
// high bits of the destination registers,
// which would destory intermediate results.
vmovss(mem(rax ), xmm0)
vmovss(mem(rax, r8, 1), xmm1) // ***TRY DELETING THIS LINE
add(imm(1*4), rax) // a += 1*cs_a = 1*4;
vmovss(mem(rbx ), xmm3)
vfmadd231ps(ymm0, ymm3, ymm4)
vmovss(mem(rbx, r11, 1), xmm3)
vfmadd231ps(ymm0, ymm3, ymm7)
vmovss(mem(rbx, r11, 2), xmm3)
vfmadd231ps(ymm0, ymm3, ymm10)
vmovss(mem(rbx, r13, 1), xmm3)
add(imm(1*4), rbx) // b += 1*rs_b = 1*4;
vfmadd231ps(ymm0, ymm3, ymm13)
dec(rsi) // i -= 1;
jne(.SLOOPKLEFT1) // iterate again if i != 0.
Yes, that fixes the issue.
Did that extra instruction do anything important (other that segfaulting)?
Yes, that fixes the issue.
Great news. Thanks for your help!
Did that extra instruction do anything important (other that segfaulting)?
No, it was 100% a copy-paste bug. I probably started with the 2x16 case and deleted instructions until it became a 1x16, but then forgot to delete that instruction (which would have loaded the second of the two elements of A).
I'll open a PR with the fix and credit you. I really appreciate your feedback!
@j-bm Sorry for getting the line numbers a little wrong btw. I had forgotten that I had inserted printf()
calls to signal the entry into and exit from that function (as a sanity check to make sure the right code was being called).
Thanks for the quick fix!
I'm going to close this issue now. If you encounter any further problems or concerns, please let us know.
Hi @fgvanzee , Creating memory for input matrixes with simple malloc( ) which creates exact size without any alignment for functionality test helps out to find these out of order memory accesses. In addition to that ASAN testing helps further.
I completely agree. Thanks for that reminder, @BhaskarNallani!
Building blis on OpenBSD (-current, that is to say most recent development version).
Built LAPACK version 3.8.0, run the test code:
Experimenting with the $ export BLIS_ARCH_TYPE= yields the conclusion zen/zen2/zen3 fails exactly as above. BLIS_ARCH_TYPE=4 (sandybridge) succeeds, as does Penryn.
It seems to be the SQZ and STQ tests that fail.