Segfault with large NUM_THREADS

Enchufa2 commented 4 years ago

In Fedora, we set NUM_THREADS=128 for the openmp and threaded versions (see spec file for reference; cc @susilehtola). Recently, we switched to openblas-openmp as the system-wide default BLAS/LAPACK implementation. Then, we found out that a test in the octave-statistics package (canoncorr.m) is segfaulting (octave was previously using openblas-serial), and we have managed to narrow down the issue to this point so far. Here's a reproducible example with the current master branch:

$ docker run --rm -it fedora:rawhide
$ dnf install -y octave-statistics make git perl-devel
$ CMD='octave -H -q --no-window-system --no-site-file --eval pkg("load","statistics");test("/usr/share/octave/packages/statistics-1.4.1/canoncorr.m");'
$ git clone https://github.com/xianyi/OpenBLAS && cd OpenBLAS
$ make USE_THREAD=1 USE_OPENMP=1 NUM_THREADS=128
$ LD_PRELOAD=$PWD/libopenblas.so.0 $CMD
Segmentation fault (core dumped)

but

$ make clean
$ make USE_THREAD=1 USE_OPENMP=1 NUM_THREADS=64
$ LD_PRELOAD=$PWD/libopenblas.so.0 $CMD
PASSES 7 out of 7 tests

Any idea what could be happening here?

martin-frbg commented 4 years ago

The LD_PRELOAD reminds me of #1936 (and #2030), stacksize requirements of OpenBLAS "somehow" interfering with later thread creation in the program.

Enchufa2 commented 4 years ago

In this case, it segfaults too without LD_PRELOAD. It was a matter of simplifying the example.

martin-frbg commented 4 years ago

Even less of an idea then, would need to see with gdb where it blows up. This with 0.3.10 or develop ?

Enchufa2 commented 4 years ago

Both. We detected this with the current release, but in the example above I checked out current develop branch.

Enchufa2 commented 4 years ago

I almost forgot... We collected this:

Thread 1 "octave-cli-5.2." received signal SIGSEGV, Segmentation fault.
0x00007ffff0f1eb19 in dsyrk_thread_UT (args=0x7fffffff8db0, range_m=0x0, range_n=0x0, sa=0x7fffb4d31000, sb=0x7fffb4e31000, mypos=0) at level3_syrk_threaded.c:508
508     int CNAME(blas_arg_t *args, BLASLONG *range_m, BLASLONG *range_n, FLOAT *sa, FLOAT *sb, BLASLONG mypos){
(gdb) bt
#0  0x00007ffff0f1eb19 in dsyrk_thread_UT (args=0x7fffffff8db0, range_m=0x0, range_n=0x0, sa=0x7fffb4d31000, sb=0x7fffb4e31000, mypos=0) at level3_syrk_threaded.c:508
#1  0x00007ffff0e2152f in dsyrk_ (UPLO=<optimized out>, TRANS=<optimized out>, N=<optimized out>, K=<optimized out>, alpha=<optimized out>, a=<optimized out>, ldA=0x7fffffff8ef0, beta=0x7fffffff8f20, c=0x555555f1faa0,
     ldC=0x7fffffff8ed8) at syrk.c:370
#2  0x00007ffff67ac436 in _Z5xgemmRK6MatrixS1_15blas_trans_typeS2_ (a=..., b=..., transa=blas_trans, transb=blas_no_trans) at liboctave/array/Array.h:582
#3  0x00007ffff75336ee in oct_binop_trans_mul (a1=..., a2=...) at libinterp/octave-value/ov-re-mat.cc:145
#4  0x00007ffff781ab44 in do_binary_op (ti=..., op=octave_value::op_herm_mul, v1=..., v2=...) at libinterp/octave-value/ov.cc:2407

(gdb) list
503     #endif
504
505       return 0;
506     }
507
508     int CNAME(blas_arg_t *args, BLASLONG *range_m, BLASLONG *range_n, FLOAT *sa, FLOAT *sb, BLASLONG mypos){
509
510       blas_arg_t newarg;
511
512     #ifndef USE_ALLOC_HEAP
(gdb) up
#1  0x00007ffff0e2152f in dsyrk_ (UPLO=<optimized out>, TRANS=<optimized out>, N=<optimized out>, K=<optimized out>, alpha=<optimized out>, a=<optimized out>, ldA=0x7fffffff8ef0, beta=0x7fffffff8f20, c=0x555555f1faa0,
     ldC=0x7fffffff8ed8) at syrk.c:370
370         (syrk[4 | (uplo << 1) | trans ])(&args, NULL, NULL, sa, sb, 0);
(gdb) up
#2  0x00007ffff67ac436 in _Z5xgemmRK6MatrixS1_15blas_trans_typeS2_ (a=..., b=..., transa=blas_trans, transb=blas_no_trans) at liboctave/array/Array.h:582
582       const T * data (void) const { return slice_data; }
(gdb) print a
$1 = (const class Matrix &) @0x7fffffff9020: {<NDArray> = {<MArray<double>> = {<Array<double>> = {_vptr.Array = 0x7ffff7f95750 <vtable for Matrix+16>, dimensions = {rep = 0x555555f04180}, rep = 0x555555f45dc0,
         slice_data = 0x555555f07200, slice_len = 20}, <No data fields>}, <No data fields>}, <No data fields>}
(gdb) print b
$2 = (const class Matrix &) @0x7fffffff8ff0: {<NDArray> = {<MArray<double>> = {<Array<double>> = {_vptr.Array = 0x7ffff7f95750 <vtable for Matrix+16>, dimensions = {rep = 0x555555f04180}, rep = 0x555555f45dc0,
         slice_data = 0x555555f07200, slice_len = 20}, <No data fields>}, <No data fields>}, <No data fields>}

martin-frbg commented 4 years ago

Hmm. dsyrk_thread_UT again (issue #2821). Not sure yet what that is, as nothing changed in syrk recently.

Enchufa2 commented 4 years ago

In this process of changing the system-wide default to openblas-openmp, I had to rebuild every single BLAS/LAPACK consumer in Fedora, and I noticed a couple of random segfaults in tests also in scipy, and another in clblast. I didn't dig further because they didn't trigger consistently and I had a ton of work to do, but in retrospect, they may be related to this and #2821. This segfault with octave is consistent though.

martin-frbg commented 4 years ago

Reproduced with the docker setup, built OpenBLAS with DEBUG=1 but cannot get gdb to print a meaningful backtrace (?? for anything except libjvm.so)

Enchufa2 commented 4 years ago

Mmmh... the backtrace above was collected without docker. I'm unable too to get the same inside docker.

Enchufa2 commented 4 years ago

And unable outside docker too. I get Backtrace stopped: previous frame inner to this frame (corrupt stack?), and useless garbage. CC'ing @opoplawski, who provided the backtrace shown in my previous comment. We could use some help. Is there any octave magic we should invoke?

martin-frbg commented 4 years ago

bisecting to see if/when/why this ever used to work

martin-frbg commented 4 years ago

No luck going back as far as 0.2.18 - gcc10 is hitting several long-resolved bugs, but after working around them the symptoms stay the same and the backtrace stays unusable. Trying to use valgrind instead of gdb only results in an immediate segfault on startup. I am a bit suspicious of the role of the Java VM lib - it is the only thing that is visible in the traces, and there have been various collisions with Java stack and memory size limits in the past. The corrupted backtraces that I am getting all claim to start from "VM_Version::get_processor_features() () from /usr/lib/jvm/jre/lib/server/libjvm.so", so maybe there is an incompatibility between Java and Docker, or Java and OpenMP already

Enchufa2 commented 4 years ago

Note that the segfault also happens with openblas-threads, not only with OpenMP (and there are no issues with blis-openmp or blis-threads). Also, the segfault triggers in Fedora 31 (Java 8, gcc 9), Fedora 32 (Java 8, gcc 10) and 33/rawhide (Java 11, gcc 10). The fact that we've seen random segfaults with clblast, which doesn't involve Java, makes me think that maybe there is an issue that Java, for some reason, makes it blow up consistently.

Enchufa2 commented 4 years ago

Mystery solved. It turns out we are hitting this: https://bugzilla.redhat.com/show_bug.cgi?id=1572811. TL;DR: JVM uses SIGSEGV for stack overflow detection (in other words, JVM is giving us a hint that this is a stack overflow), and it is masking whatever happens next under gdb. So, just hit "continue" and you'll get a nice clean backtrace. :)

martin-frbg commented 4 years ago

No chance to break out of the Java-induced SIGSEGV, gdb just gets another SIGSEGV in the same location after "continue" - and so ad infinitum (or at least for significantly more than the hypothetical 128 threads that should not even get instantiated on a less well endowed machine).

martin-frbg commented 4 years ago

Seems the docker reproducer segfaults on my system with or without libopenblas preloaded, and never gets far enough to even enter level3_syrk_threaded.c under gdb with "handle 11 nostop noprint nopass"

Enchufa2 commented 4 years ago

Try this:

$ docker run --rm -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined fedora:32
$ dnf update -y && dnf install -y octave-statistics gdb openblas-*
$ CMD='octave -H -q --no-window-system --no-site-file --eval pkg("load","statistics");test("/usr/share/octave/packages/statistics-1.4.1/canoncorr.m");'
$ gdb --args env LD_PRELOAD=/usr/lib64/libopenblaso.so.0 $CMD
(gdb) r
...
(gdb) c
Continuing.
[New Thread 0x7fffbe122700 (LWP 1339)]
...

Thread 1 "octave-cli-5.2." received signal SIGSEGV, Segmentation fault.
0x00007ffff5ddb38d in dsyrk_thread_UT () from /usr/lib64/libopenblaso.so.0
(gdb) bt
#0  0x00007ffff5ddb38d in dsyrk_thread_UT () from /usr/lib64/libopenblaso.so.0
#1  0x00007ffff5cdc79f in dsyrk_ () from /usr/lib64/libopenblaso.so.0
#2  0x00007ffff41a677e in xgemm(Matrix const&, Matrix const&, blas_trans_type, blas_trans_type) ()
   from /usr/lib64/octave/5.2.0/liboctave.so.7
...

Enchufa2 commented 4 years ago

And if you add dnf debuginfo-install openblas-openmp at the beginning, the backtrace contains more detail, obviously.

martin-frbg commented 4 years ago

Thanks, that's better. Unfortunately it still does not explain what happens - arguments go from totally sane in interface/syrk.c to unreadable garbage in level3_syrk_threaded.c as far as gdb is concerned, and valgrind is effectively neutralised by the java segfault. Retrying the bisect in the hope that it uncovers something this time.

martin-frbg commented 4 years ago

Just managed to catch this from a NUM_THREADS=88 build now (which does not make much sense either, the m,n,k are 2,2,10 in the caller, i.e. dsyrk_UT) . Tried building with gcc's asan and ubsan but they did not find anything (except a known problem in ctest's way of collating the openblas_utest modules).

hread 1 "octave-cli-5.2." received signal SIGSEGV, Segmentation fault.
0x00007fffebeba154 in dsyrk_kernel_U (m=0, n=0, k=0, alpha_r=0, a=0x0, b=0x0, c=0x0, ldc=2, offset=0) at syrk_kernel.c:60
60                 FLOAT *a, FLOAT *b, FLOAT *c, BLASLONG ldc, BLASLONG offset){
(gdb) bt
#0  0x00007fffebeba154 in dsyrk_kernel_U (m=0, n=0, k=0, alpha_r=0, a=0x0, b=0x0, c=0x0, ldc=2, offset=0) at syrk_kernel.c:60
#1  0x00007fffebea8c75 in dsyrk_UT (args=0x7fffffff9910, range_m=0x0, range_n=0x0, sa=0x7ffe5d222000, sb=0x7ffe5d322000, dummy=0) at level3_syrk.c:231
#2  0x00007fffebf03b04 in dsyrk_thread_UT (args=0x7fffffff9910, range_m=0x0, range_n=0x0, sa=0x7ffe5d222000, sb=0x7ffe5d322000, mypos=0) at level3_syrk_threaded.c:532
#3  0x00007fffeb89f2ce in dsyrk_ (UPLO=0x7fffe4424fed "U", TRANS=0x7fffffff9a6c "T\177", N=0x7fffffff9a58, K=0x7fffffff9a54, alpha=0x7fffffff9ad0, a=0x60e000155600, 
    ldA=0x7fffffff9a70, beta=0x7fffffff9aa0, c=0x603000755dd0, ldC=0x7fffffff9a58) at syrk.c:370

Enchufa2 commented 4 years ago

Makes sense if another thread is overwriting them. The question is how this happens and why with this NUM_THREADS and not with lower values.

martin-frbg commented 4 years ago

It would be easier to understand if there were actually that many threads running (thinking low-probability race condition ) but as the number of threads is capped at the hardware capability the NUM_THREADS should only size the GEMM buffer here.

Enchufa2 commented 4 years ago

A race condition should trigger random failures. The consistency of the segfault suggests that threads are being assigned somehow a wrong position in the buffer maybe? I don't know the internals of this buffer, so just speculating here.

martin-frbg commented 4 years ago

The buffer "only" contains the results of individual submatrix computations, any conflict there should not lead to overwriting of function arguments on the stack (I think).

Enchufa2 commented 4 years ago

I found that there's a BLAS3_MEM_ALLOC_THRESHOLD of 160 here. Then USE_ALLOC_HEAP is used if MAX_CPU_NUMBER > BLAS3_MEM_ALLOC_THRESHOLD here. ~This happens (because MAX_CPU_NUMBER = NUM_THREADS * 2) for NUM_THREADS = 128 and also your case, NUM_THREADS = 88, but not for NUM_THREADS = 64 (which doesn't cause a segfault).~ Maybe this rings a bell?

martin-frbg commented 4 years ago

NUM_THREADS=85 gave no segfault but I'll try playing with that threshold just in case. Maybe this is more related to gcc10 optimizing some already fragile code and the bigger buffer from a high NUM_THREADS just changes memory layout to make the consequences more interesting.

Enchufa2 commented 4 years ago

Note that my example above is replicable too if you switch from fedora:32 to fedora:31, which ships gcc9.

martin-frbg commented 4 years ago

Same crash after a tenfold increase of BLAS3_MEM_ALLOC_THRESHOLD. thread sanitizer found some unfriendly things to say about the sanity of level3_thread from a gemm perspective but did not complain about the code path taken by syrk. (Unfortunately it refuses to run in the java/gdb context, so this was from running the BLAS tests)

martin-frbg commented 4 years ago

Actually decreasing the threshold (e.g. to 60) seems to make the crash go away, so it could be that it was the job array itself that is/was trashing the stack. Curiouser and curiouser...

Enchufa2 commented 4 years ago

BTW, I incorrectly stated that MAX_CPU_NUMBER = NUM_THREADS * 2, but no, it's MAX_CPU_NUMBER = NUM_THREADS and USAGE.md says that NUM_BUFFERS = MAX_CPU_NUMBER * 2. But that's not true either.

martin-frbg commented 4 years ago

Too early to celebrate, this could be a Heisenbug. I can also make it work by (only) increasing NUM_THREADS even further (208)...

Enchufa2 commented 4 years ago

Actually, NUM_THREADS > 160 triggers heap allocation. Sorry for the confusion above.

Enchufa2 commented 4 years ago

So this makes sense, right? MAX_CPU_NUMBER = NUM_THREADS = 128 seems to overflow the stack. If you decrease the threshold, heap allocation is used instead; if you increase NUM_THREADS even further, heap allocation is used again. If you decrease NUM_THREADS, stack allocation is used, but there is no overflow.

Enchufa2 commented 4 years ago

Behold:

$ _JAVA_OPTIONS="-Xss2048k" LD_PRELOAD=/lib64/libopenblaso.so.0 $CMD
Picked up _JAVA_OPTIONS: -Xss2048k
Segmentation fault (core dumped)
$ _JAVA_OPTIONS="-Xss4096k" LD_PRELOAD=/lib64/libopenblaso.so.0 $CMD
Picked up _JAVA_OPTIONS: -Xss4096k
PASSES 7 out of 7 tests

martin-frbg commented 4 years ago

Ah, ok... this is where Java messes things up for us (again - there are examples in closed issues where the small default java stack caused crashes). I had already wondered if/how/where to set Xss but did not think of an environment variable. Default stack on most(?) Linux distros is 8192k

Enchufa2 commented 4 years ago

Default stack on most(?) Linux distros is 8192k

I think so. And default stack on most(?) Java installations is 1024k AFAIK. I suppose that BLAS3_MEM_ALLOC_THRESHOLD was calculated for the default stack on Linux, and Java is messing with that. However, I'd say that a proper fix would be to have a threshold based on the actual stack size at runtime.

martin-frbg commented 4 years ago

Might make sense to adjust behaviour according to a getrlimit call at runtime instead of relying on a fixed threshold, but I guess the whole point is to try to avoid the system call overhead. Alternatively, Octave could adjust their _JAVA_OPTIONS ?

Enchufa2 commented 4 years ago

That's possible, yes, because octave integrates Java. But I'd bet #2821 is the same issue, and scipy knows nothing about Java. So whenever Java is involved, this could happen. And the same can be said whenever a user changes the stack size.

martin-frbg commented 4 years ago

The trick is that #2821 only crashes when he loads libhdfs.so, which appears to be java-based... I have just copied our current understanding of the problem there.

Enchufa2 commented 4 years ago

Yes, that's my point: octave can ensure that the proper stack size is set, but scipy doesn't use Java at all. But then, if someone uses scipy and another Java-based library in the same script, boom. Hadoop knows nothing about OpenBLAS requirements for the stack either. So in the end, it's the user's responsibility.

martin-frbg commented 4 years ago

Could have sworn I had put it in the FAQ long ago (where few would read it anyway I guess). Can't Java just go away and die out ?

Enchufa2 commented 4 years ago

Can't Java just go away and die out?

:D Cannot agree more. But here we are, and many distributed technologies, big data tools and other scientific stuff are Java-based. ¯_(ツ)_/¯

colinfang commented 4 years ago

Does Java modify the stack size of non-java spawned threads (the ones openblas creates)?

Enchufa2 commented 4 years ago

Yes, it does. It doesn't matter which code path spawns those threads: as soon as the JVM lives in the main process, new threads inherit those parameters.

brada4 commented 4 years ago

try MAX_STACK_ALLOC=0, it is something around stack "one size fits all" code, may not play nice with stack reduced by somebody else.

Enchufa2 commented 4 years ago

I already tried, and it doesn't help.

martin-frbg commented 4 years ago

I do not quite like the idea of a runtime check (portability and all), perhaps reduce the default THRESHOLD to a value that is safe with the java default, and at the same time make it configurable so that anybody who is certain to not use java and to want every little gain in performance can restore the old behaviour (or even go beyond that with a suitable ulimit -s) ? Back in the day everybody used to build the library for their specific needs but now with binary distributions and indirect dependencies it may be better to err on the side of caution.

Enchufa2 commented 4 years ago

With an upstream-developer hat on, I fully understand. With the Fedora hat on (pun intended), OpenBLAS already detects the number of CPUs available to set a proper number of threads, so I don't see how this should be different. :)

martin-frbg commented 4 years ago

Point taken, and at least Windows gets special treatment already and getrlimit() should be sufficiently portable across anything unixoid. BTW relevant previous commit is https://github.com/xianyi/OpenBLAS/commit/32d2ca3035d76e18b2bc64c7bfbe3fad2dba234b from seven years ago, before that it used to be individual NUM_CORES-based limits in each affected source file (and incidentally the "Windows" value of the threshold is specifically for a 1MB stack).

martin-frbg commented 4 years ago

Hmm. Don't really see a clean way to go from a (compile-time) decision between allocating on either heap or stack to a run-time choice on startup (that does not lose the advantages of stack allocation and does not clutter up the stack with effectively unused allocations either). Of course what could be done easily is a runtime warning that the current stack is too small - there is commented-out (and most probably ineffective) code from xianyi's #221 in memory.c to try and raise a "soft" stack limit to the "hard" maximum so he appears to have been there before)

OpenMathLib / OpenBLAS

Segfault with large NUM_THREADS #2839