OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.38k stars 1.5k forks source link

segfault in dblat2 in dgemv_HASWELL in OpenBLAS 0.3.5 #2009

Closed susilehtola closed 5 years ago

susilehtola commented 5 years ago

Rebuilding OpenBLAS 0.3.5 on Fedora rawhide (to-be Fedora 30) results in a segfault in the tests

OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./dblat2 < ./dblat2.dat
BUILDSTDERR: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
BUILDSTDERR: Backtrace for this error:
BUILDSTDERR: #0  0x7f21b3836ca1 in ???
BUILDSTDERR: #1  0x7f21b3835e65 in ???
BUILDSTDERR: #2  0x7f21b368717f in ???
BUILDSTDERR: #3  0x5578496125f2 in dgemv_kernel_4x1
BUILDSTDERR:    at ../kernel/x86_64/dgemv_n_4.c:143
BUILDSTDERR: #4  0x557849612958 in dgemv_n_HASWELL
BUILDSTDERR:    at ../kernel/x86_64/dgemv_n_4.c:318
BUILDSTDERR: #5  0x5578487bc333 in dgemv_
BUILDSTDERR:    at /builddir/build/BUILD/openblas-0.3.5/Rblas/interface/gemv.c:231
BUILDSTDERR: #6  0x5578487b6e8e in dchk1_
BUILDSTDERR:    at /builddir/build/BUILD/openblas-0.3.5/Rblas/test/dblat2.f:582
BUILDSTDERR: #7  0x5578487bbb02 in dblat2
BUILDSTDERR:    at /builddir/build/BUILD/openblas-0.3.5/Rblas/test/dblat2.f:305
BUILDSTDERR: #8  0x5578487ad052 in main
BUILDSTDERR:    at /builddir/build/BUILD/openblas-0.3.5/Rblas/test/dblat2.f:390
BUILDSTDERR: /bin/sh: line 1: 20348 Segmentation fault      (core dumped) OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./dblat2 < ./dblat2.dat
BUILDSTDERR: make[1]: *** [Makefile:32: level2] Error 139
BUILDSTDERR: make[1]: *** Waiting for unfinished jobs....
make[1]: Leaving directory '/builddir/build/BUILD/openblas-0.3.5/Rblas/test'
make: Leaving directory '/builddir/build/BUILD/openblas-0.3.5/Rblas'
BUILDSTDERR: make: *** [Makefile:124: tests] Error 2

The compiler is

gcc                     x86_64 9.0.1-0.4.fc30                      build  22 M
gcc-gfortran            x86_64 9.0.1-0.4.fc30                      build  10 M

and the Rblas version of the library is compiled with

make -C Rblas TARGET=CORE2 DYNAMIC_ARCH=1 DYNAMIC_OLDER=1 USE_THREAD=0 USEOPENMP=0 FC=gfortran CC=gcc 'COMMON_OPT=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fPIC' 'FCOMMON_OPT=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fPIC -frecursive' NUM_THREADS=128 LIBPREFIX=libRblas LIBSONAME=libRblas.so INTERFACE64=0

Full build log at https://kojipkgs.fedoraproject.org//work/tasks/5620/32755620/build.log

martin-frbg commented 5 years ago

Looks like yet another instance of #1964, input operands being declared as readonly despite getting (ab)used as loop counters in the assembly.

martin-frbg commented 5 years ago

Original problem should be fixed now, but I suspect a similar issue with the trsm kernels for Bulldozer and the dtrsm_kernel_RN_haswell.c. (Not corrected yet as I wonder about the correct handling of %8 and %9 in the latter, and %6 and %7 in the Bulldozer kernels)

martin-frbg commented 5 years ago

I believe all bugs of this type should be fixed now (on the develop branch that is planned to become 0.3.6 in about two weeks).

susilehtola commented 5 years ago

Awesome. I've pulled in the patches #2010, #2018, #2019, #2021, #2023, and #2024; hopefully 0.3.5 builds now with gcc 9.

martin-frbg commented 5 years ago

It did in my tests on Kaby Lake (Haswell kernels), I am building the most recent snapshot of gcc9 on Ryzen2700 right now to confirm correct behaviour with -march=znver1 (environment of #2018).

susilehtola commented 5 years ago

Built and tested fine at my end.

opoplawski commented 5 years ago

I think we still have some issues - koschei reports that a number of packages including octave and arpack are now failing on x86_64 with these patches applied:

https://apps.fedoraproject.org/koschei/affected-by/openblas-devel?epoch1=0&version1=0.3.5&release1=1.fc30&epoch2=0&version2=0.3.5&release2=3.fc30&collection=f30

octave:

  liboctave/array/CMatrix.cc-tst .............................. PASS     10/11  
                                                                  FAIL    1
  liboctave/array/CSparse.cc-tst .............................. PASS     10/10  
  liboctave/array/Sparse.cc-tst ............................... PASS    107/107 
BUILDSTDERR:   liboctave/array/dMatrix.cc-tst ..............................fatal: caught signal Segmentation fault -- stopping myself...
BUILDSTDERR: /bin/sh: line 1: 13322 Segmentation fault      (core dumped) /bin/sh ../run-octave --norc --silent --no-history /builddir/build/BUILD/octave-4.4.1/test/fntests.m /builddir/build/BUILD/octave-4.4.1/test
martin-frbg commented 5 years ago

Are you sure that the failures are actually related to OpenBLAS ? I cannot reproduce the arpack one at least (first on your list). Can you get a backtrace from the octave and/or numpy segfault ? Edit: Cannot reproduce the Octave test failures either, what was the hardware environment for your fedora builds ?

QuLogic commented 5 years ago

Some hardware info can be found from the hw_info.log in the builds (e.g., this one for octave):

CPU info:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              6
On-line CPU(s) list: 0-5
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           6
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               60
Model name:          Intel Core Processor (Haswell, no TSX, IBRS)
Stepping:            1
CPU MHz:             2299.998
BogoMIPS:            4599.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0-5
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single pti ibrs ibpb fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat

(All of arpack, octave and numpy builds are the same, except for total memory, which I didn't bother copying.)

QuLogic commented 5 years ago

@susilehtola looks like we should backport #2028 as well (not sure if it'll fix everything since it's piledriver, not haswell.)

martin-frbg commented 5 years ago

My tests were done with Kaby Lake (Haswell) and Ryzen 2700 (Zen, which for the most part is Haswell) so your build failures are likely the result of a change in some other dependency. (Or you are missing some other PR on top of 0.3.5 - #2028 is not going to affect anything as it appears that the file in question is not even used for the AMD Piledriver target itself.) If you still think that OpenBLAS is to blame, please provide more information - unfortunately there are far too few developers on this project to expect us to trawl through your build logs or install any and all packages just to see if and where they break)

martin-frbg commented 5 years ago

If you are actually using 0.3.5 with only the patches from 2010 onwards, you are missing #1965 to #1967 (the patches for issue #1964 mentioned above). I believe the other changes since the 0.3.5 release on december 31 are unlikely to have serious impact on Haswell (most do not affect x86_64 at all). 0.3.6 is still planned to be released next weekend - unless serious regressions get reported against the develop branch in the meantime.

QuLogic commented 5 years ago

Unfortunately, the trouble with getting a backtrace is that it's a bit of a heisenbug. I was able to reproduce the same issue from the builder locally, but whenever I installed the debug symbols, it stopped crashing in gdb. Using a different failing package, I was able to get valgrind to point to dscal_k_HASWELL (dscal.c:226).

However, today I've now figured out how to build against openblas master, and it appears this crash is now fixed. I think you are correct that there are simply not enough patches backported (e.g., dscal appears changed by #1966). I will try out the other failing packages with this to see if there are any other real issues.

martin-frbg commented 5 years ago

Wait - is it master or develop that you built against ? master is stuck at 0.2.20, releases since then have been created from the 0.3.0 branch (which gets updated from develop about every two months)

QuLogic commented 5 years ago

Sorry, I meant develop, specifically, fd34820b99bd302ed2b31ca0e5fedeb492a179c7.

opoplawski commented 5 years ago

@susilehtola would you be opposed to building the head of develop in Fedora rawhide? That should give some useful testing here I think.

susilehtola commented 5 years ago

@opoplawski I've included #1965-#1967 in the latest build. Originally, I just wanted to patch out the stuff that isn't working, since there might be unstable stuff in the develop branch.

If @martin-frbg agrees, I (or @opoplawski) can switch to using the development branch until 0.3.6 is released.

martin-frbg commented 5 years ago

There is not supposed to be anything unstable in the develop branch right now (except perhaps the AVX512 DGEMM, but that is no recent regression). My not yet merged #2026 poses a small risk as it re-enables a multithreaded codepath that had been disabled in an earlier search for the source of a loss of precision.

susilehtola commented 5 years ago

@opoplawski I don't really have time to work on this; you're free to bump openblas to the develop branch.

opoplawski commented 5 years ago

Looks like octave is building fine now with the latest openblas in Rawhide. I did prepare an updated package with the latest git, but I think I'll hold off on that for now.