Closed susilehtola closed 5 years ago
Looks like yet another instance of #1964, input operands being declared as readonly despite getting (ab)used as loop counters in the assembly.
Original problem should be fixed now, but I suspect a similar issue with the trsm kernels for Bulldozer and the dtrsm_kernel_RN_haswell.c. (Not corrected yet as I wonder about the correct handling of %8 and %9 in the latter, and %6 and %7 in the Bulldozer kernels)
I believe all bugs of this type should be fixed now (on the develop
branch that is planned to become 0.3.6 in about two weeks).
Awesome. I've pulled in the patches #2010, #2018, #2019, #2021, #2023, and #2024; hopefully 0.3.5 builds now with gcc 9.
It did in my tests on Kaby Lake (Haswell kernels), I am building the most recent snapshot of gcc9 on Ryzen2700 right now to confirm correct behaviour with -march=znver1
(environment of #2018).
Built and tested fine at my end.
I think we still have some issues - koschei reports that a number of packages including octave and arpack are now failing on x86_64 with these patches applied:
octave:
liboctave/array/CMatrix.cc-tst .............................. PASS 10/11
FAIL 1
liboctave/array/CSparse.cc-tst .............................. PASS 10/10
liboctave/array/Sparse.cc-tst ............................... PASS 107/107
BUILDSTDERR: liboctave/array/dMatrix.cc-tst ..............................fatal: caught signal Segmentation fault -- stopping myself...
BUILDSTDERR: /bin/sh: line 1: 13322 Segmentation fault (core dumped) /bin/sh ../run-octave --norc --silent --no-history /builddir/build/BUILD/octave-4.4.1/test/fntests.m /builddir/build/BUILD/octave-4.4.1/test
Are you sure that the failures are actually related to OpenBLAS ? I cannot reproduce the arpack one at least (first on your list). Can you get a backtrace from the octave and/or numpy segfault ? Edit: Cannot reproduce the Octave test failures either, what was the hardware environment for your fedora builds ?
Some hardware info can be found from the hw_info.log
in the builds (e.g., this one for octave):
CPU info:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 6
On-line CPU(s) list: 0-5
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 6
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Model name: Intel Core Processor (Haswell, no TSX, IBRS)
Stepping: 1
CPU MHz: 2299.998
BogoMIPS: 4599.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
L3 cache: 16384K
NUMA node0 CPU(s): 0-5
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single pti ibrs ibpb fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat
(All of arpack, octave and numpy builds are the same, except for total memory, which I didn't bother copying.)
@susilehtola looks like we should backport #2028 as well (not sure if it'll fix everything since it's piledriver, not haswell.)
My tests were done with Kaby Lake (Haswell) and Ryzen 2700 (Zen, which for the most part is Haswell) so your build failures are likely the result of a change in some other dependency. (Or you are missing some other PR on top of 0.3.5 - #2028 is not going to affect anything as it appears that the file in question is not even used for the AMD Piledriver target itself.) If you still think that OpenBLAS is to blame, please provide more information - unfortunately there are far too few developers on this project to expect us to trawl through your build logs or install any and all packages just to see if and where they break)
If you are actually using 0.3.5 with only the patches from 2010 onwards, you are missing #1965 to #1967 (the patches for issue #1964 mentioned above). I believe the other changes since the 0.3.5 release on december 31 are unlikely to have serious impact on Haswell (most do not affect x86_64 at all). 0.3.6 is still planned to be released next weekend - unless serious regressions get reported against the develop branch in the meantime.
Unfortunately, the trouble with getting a backtrace is that it's a bit of a heisenbug. I was able to reproduce the same issue from the builder locally, but whenever I installed the debug symbols, it stopped crashing in gdb. Using a different failing package, I was able to get valgrind to point to dscal_k_HASWELL
(dscal.c:226).
However, today I've now figured out how to build against openblas master
, and it appears this crash is now fixed. I think you are correct that there are simply not enough patches backported (e.g., dscal appears changed by #1966). I will try out the other failing packages with this to see if there are any other real issues.
Wait - is it master
or develop
that you built against ? master
is stuck at 0.2.20, releases since then have been created from the 0.3.0 branch (which gets updated from develop
about every two months)
Sorry, I meant develop
, specifically, fd34820b99bd302ed2b31ca0e5fedeb492a179c7.
@susilehtola would you be opposed to building the head of develop in Fedora rawhide? That should give some useful testing here I think.
@opoplawski I've included #1965-#1967 in the latest build. Originally, I just wanted to patch out the stuff that isn't working, since there might be unstable stuff in the develop
branch.
If @martin-frbg agrees, I (or @opoplawski) can switch to using the development branch until 0.3.6 is released.
There is not supposed to be anything unstable in the develop branch right now (except perhaps the AVX512 DGEMM, but that is no recent regression). My not yet merged #2026 poses a small risk as it re-enables a multithreaded codepath that had been disabled in an earlier search for the source of a loss of precision.
@opoplawski I don't really have time to work on this; you're free to bump openblas to the develop branch.
Looks like octave is building fine now with the latest openblas in Rawhide. I did prepare an updated package with the latest git, but I think I'll hold off on that for now.
Rebuilding OpenBLAS 0.3.5 on Fedora rawhide (to-be Fedora 30) results in a segfault in the tests
The compiler is
and the Rblas version of the library is compiled with
Full build log at https://kojipkgs.fedoraproject.org//work/tasks/5620/32755620/build.log