OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.38k stars 1.5k forks source link

Test failure with -fbounds-check #2657

Closed orlitzky closed 4 years ago

orlitzky commented 4 years ago

The -fbounds-check or -fcheck=bounds flags to gfortran

Enable generation of run-time checks for array subscripts and against the declared minimum and maximum values. It also checks array indices for assumed and deferred shape arrays against the actual allocated bounds and ensures that all string lengths are equal for character array constructors without an explicit typespec.

When I build OpenBLAS with FFLAGS="-fbounds-check" FCFLAGS="-fbounds-check", I get a reproducible test failure (on amd64):

TEST 27/27 kernel_regress:skx_avx At line 211 of file dgesvd.f
Fortran runtime error: Actual string length is shorter than the declared one for dummy argument 'jobvt' (-4618395555133915136/1)
make[1]: *** [Makefile:45: run_test] Error 2
make[1]: Leaving directory '/var/tmp/portage/sci-libs/openblas-0.3.9/work/openblas-0.3.9/utest'
make: *** [Makefile:126: tests] Error 2

I'm not sure if it's a problem with the test suite or the file itself. A user reported this to us at https://bugs.gentoo.org/726474

martin-frbg commented 4 years ago

Not sure I understand this (unless it is a side effect of tne general Fortran/C interfacing problem from #2154), in particular why would it flag jobvt but not the identically declared jobu here ? FWIW valgrind does not complain. Which version of gfortran and what other build options do you use ? (One possibility could be that your FFLAGS override Makefile defaults, e.g. dropping the -fno-optimize-sibling-calls used to work around the interface issue)

martin-frbg commented 4 years ago

FFLAGS were overridden as intended by adding required flags in Makefile.system, but the make.inc produced for the lapack-netlib (Reference-LAPACK) subtree got these without the buildtime override. So LAPACK got built with your "-fbounds-check" alone (dropping both the mandatory -frecursive and the -fno-optimize-sibling-calls). With that fixed however I now get a similar - and similarly spurious - bounds check failure for a zpotrs call in test 33 (after similar calls to potrf in the same file completed without issues), which I can only get around by adding the hidden argument for the size of the character argument. Clearly this requires changing the signature of zpotrs in common_interface.h , which would break the established API. (Which is why #2154 is still open, as Reference-LAPACK has not adopted a solution so far and GCC has backtracked on requiring hidden length arguments for character(1))

orlitzky commented 4 years ago

Did I mess up my testing somehow? If I understand you correctly, you're saying there are two different problems:

  1. By passing FFLAGS to the build, the important -frecursive and -fno-optimize-sibling-calls flags are accidentally clobbered in some places.
  2. There's another unrelated test failure due to the bounds checking.

While testing -- to ensure that it was really the bounds-checking flag that was causing the problem -- I ran the same exact command with and without the bounds-checking flags. This command (inspired by the user's report) failed before https://github.com/gentoo/gentoo/commit/57e2be19d17640729372235c4c73171e069d2cca:

root # FFLAGS="-O2 -pipe -march=native -fstack-protector-strong -fstack-check -fbounds-check" FCFLAGS="-O2 -pipe -march=native -fstack-protector-strong -fstack-check -fbounds-check" USE=test FEATURES=test emerge -v1 =sci-libs/openblas-0.3.9

After that commit, which only filters out the -fbounds-check flag, the same command works. If passing any FFLAGS clobbered the -frecursive and -fno-optimize-sibling-calls flags, wouldn't they still get clobbered in my example above? After filtering the bounds-checking flag, we still wind up with FFLAGS="-O2 -pipe -march=native -fstack-protector-strong -fstack-check".

martin-frbg commented 4 years ago

Yes. For (1) the problem lies with the way the Reference-LAPACK source is docked into the OpenBLAS build system - we still utilize their approach of creating a definitions file (make.inc) with compiler settings and its FFLAGS get overriden by yours. The effects of missing -frecursive and -fno-optimize-sibling-calls will not be immediately obvious, but both can lead to wrong results or strange crashes. For item (2) I am now playing with a new #ifdef HIDDENARGS that adds a set of alternate definitions for the LAPACK functions where the additional length arguments are appended. Defining this in the affected tests should solve the-fbounds-check` problem while user code would still be able to use the conventional interface (that only gfortran 8 & 9 took issue with)

orlitzky commented 4 years ago

Thanks for clarifying. What's bothering me is, why does https://github.com/gentoo/gentoo/commit/57e2be19d17640729372235c4c73171e069d2cca appear to fix the problem? All I've done is remove the -fbounds-check flag from the user's FFLAGS. That still leaves all the other FFLAGS, so I would expect -frecursive and -fno-optimize-sibling-calls to still be overridden. I can't say for sure that that's not the case (I'm not 100% sure which lines to look at in the build log), but the build failures do disappear after stripping only -fbounds-check from FFLAGS="-O2 -pipe -march=native -fstack-protector-strong -fstack-check -fbounds-check".

martin-frbg commented 4 years ago

It fixes the problem as the only visible problem leading to your build failure was in the bounds check. You would notice the missing -frecursive only in some specific LAPACK functions when called in user code, and the missing -fno-optimize-sibling-calls could "only" lead to rare and hard to reproduce crashes (stack corruptions) in user code when compiling with gfortran version 8.x or 9.x The fix for the unwanted override would be to change the "FFLAGS=" to "override FFLAGS=" in the "lapack_prebuild" section of the toplevel Makefile in OpenBLAS, or to add the two gfortran options to your own FFLAGS. (Proper fix for the bounds-check fault - instead of just stripping the option - is harder than I hoped as I cannot find a solution that does not affect most of the upstream LAPACKE code)

orlitzky commented 4 years ago

Thanks again. I took your advice and added "override" to that line. I also forced a rebuild of OpenBLAS for all of our users, in case any had compiled it with FFLAGS set in the past.

martin-frbg commented 4 years ago

Added an entry to the FAQ in the wiki

thesamesam commented 3 years ago

Note that this seems to have been done in https://github.com/xianyi/OpenBLAS/commit/2d33e12a119f0cf97e5c41ff4f6499e9229d9bd5 for https://github.com/xianyi/OpenBLAS/pull/3390.

Thanks!