Closed Enchufa2 closed 4 years ago
The LD_PRELOAD reminds me of #1936 (and #2030), stacksize requirements of OpenBLAS "somehow" interfering with later thread creation in the program.
In this case, it segfaults too without LD_PRELOAD. It was a matter of simplifying the example.
Even less of an idea then, would need to see with gdb where it blows up. This with 0.3.10 or develop
?
Both. We detected this with the current release, but in the example above I checked out current develop branch.
I almost forgot... We collected this:
Thread 1 "octave-cli-5.2." received signal SIGSEGV, Segmentation fault.
0x00007ffff0f1eb19 in dsyrk_thread_UT (args=0x7fffffff8db0, range_m=0x0, range_n=0x0, sa=0x7fffb4d31000, sb=0x7fffb4e31000, mypos=0) at level3_syrk_threaded.c:508
508 int CNAME(blas_arg_t *args, BLASLONG *range_m, BLASLONG *range_n, FLOAT *sa, FLOAT *sb, BLASLONG mypos){
(gdb) bt
#0 0x00007ffff0f1eb19 in dsyrk_thread_UT (args=0x7fffffff8db0, range_m=0x0, range_n=0x0, sa=0x7fffb4d31000, sb=0x7fffb4e31000, mypos=0) at level3_syrk_threaded.c:508
#1 0x00007ffff0e2152f in dsyrk_ (UPLO=<optimized out>, TRANS=<optimized out>, N=<optimized out>, K=<optimized out>, alpha=<optimized out>, a=<optimized out>, ldA=0x7fffffff8ef0, beta=0x7fffffff8f20, c=0x555555f1faa0,
ldC=0x7fffffff8ed8) at syrk.c:370
#2 0x00007ffff67ac436 in _Z5xgemmRK6MatrixS1_15blas_trans_typeS2_ (a=..., b=..., transa=blas_trans, transb=blas_no_trans) at liboctave/array/Array.h:582
#3 0x00007ffff75336ee in oct_binop_trans_mul (a1=..., a2=...) at libinterp/octave-value/ov-re-mat.cc:145
#4 0x00007ffff781ab44 in do_binary_op (ti=..., op=octave_value::op_herm_mul, v1=..., v2=...) at libinterp/octave-value/ov.cc:2407
(gdb) list
503 #endif
504
505 return 0;
506 }
507
508 int CNAME(blas_arg_t *args, BLASLONG *range_m, BLASLONG *range_n, FLOAT *sa, FLOAT *sb, BLASLONG mypos){
509
510 blas_arg_t newarg;
511
512 #ifndef USE_ALLOC_HEAP
(gdb) up
#1 0x00007ffff0e2152f in dsyrk_ (UPLO=<optimized out>, TRANS=<optimized out>, N=<optimized out>, K=<optimized out>, alpha=<optimized out>, a=<optimized out>, ldA=0x7fffffff8ef0, beta=0x7fffffff8f20, c=0x555555f1faa0,
ldC=0x7fffffff8ed8) at syrk.c:370
370 (syrk[4 | (uplo << 1) | trans ])(&args, NULL, NULL, sa, sb, 0);
(gdb) up
#2 0x00007ffff67ac436 in _Z5xgemmRK6MatrixS1_15blas_trans_typeS2_ (a=..., b=..., transa=blas_trans, transb=blas_no_trans) at liboctave/array/Array.h:582
582 const T * data (void) const { return slice_data; }
(gdb) print a
$1 = (const class Matrix &) @0x7fffffff9020: {<NDArray> = {<MArray<double>> = {<Array<double>> = {_vptr.Array = 0x7ffff7f95750 <vtable for Matrix+16>, dimensions = {rep = 0x555555f04180}, rep = 0x555555f45dc0,
slice_data = 0x555555f07200, slice_len = 20}, <No data fields>}, <No data fields>}, <No data fields>}
(gdb) print b
$2 = (const class Matrix &) @0x7fffffff8ff0: {<NDArray> = {<MArray<double>> = {<Array<double>> = {_vptr.Array = 0x7ffff7f95750 <vtable for Matrix+16>, dimensions = {rep = 0x555555f04180}, rep = 0x555555f45dc0,
slice_data = 0x555555f07200, slice_len = 20}, <No data fields>}, <No data fields>}, <No data fields>}
Hmm. dsyrk_thread_UT again (issue #2821). Not sure yet what that is, as nothing changed in syrk recently.
In this process of changing the system-wide default to openblas-openmp, I had to rebuild every single BLAS/LAPACK consumer in Fedora, and I noticed a couple of random segfaults in tests also in scipy, and another in clblast. I didn't dig further because they didn't trigger consistently and I had a ton of work to do, but in retrospect, they may be related to this and #2821. This segfault with octave is consistent though.
Reproduced with the docker setup, built OpenBLAS with DEBUG=1 but cannot get gdb to print a meaningful backtrace (?? for anything except libjvm.so)
Mmmh... the backtrace above was collected without docker. I'm unable too to get the same inside docker.
And unable outside docker too. I get Backtrace stopped: previous frame inner to this frame (corrupt stack?)
, and useless garbage. CC'ing @opoplawski, who provided the backtrace shown in my previous comment. We could use some help. Is there any octave magic we should invoke?
bisecting to see if/when/why this ever used to work
No luck going back as far as 0.2.18 - gcc10 is hitting several long-resolved bugs, but after working around them the symptoms stay the same and the backtrace stays unusable. Trying to use valgrind instead of gdb only results in an immediate segfault on startup. I am a bit suspicious of the role of the Java VM lib - it is the only thing that is visible in the traces, and there have been various collisions with Java stack and memory size limits in the past. The corrupted backtraces that I am getting all claim to start from "VM_Version::get_processor_features() () from /usr/lib/jvm/jre/lib/server/libjvm.so", so maybe there is an incompatibility between Java and Docker, or Java and OpenMP already
Note that the segfault also happens with openblas-threads, not only with OpenMP (and there are no issues with blis-openmp or blis-threads). Also, the segfault triggers in Fedora 31 (Java 8, gcc 9), Fedora 32 (Java 8, gcc 10) and 33/rawhide (Java 11, gcc 10). The fact that we've seen random segfaults with clblast, which doesn't involve Java, makes me think that maybe there is an issue that Java, for some reason, makes it blow up consistently.
Mystery solved. It turns out we are hitting this: https://bugzilla.redhat.com/show_bug.cgi?id=1572811. TL;DR: JVM uses SIGSEGV for stack overflow detection (in other words, JVM is giving us a hint that this is a stack overflow), and it is masking whatever happens next under gdb. So, just hit "continue" and you'll get a nice clean backtrace. :)
No chance to break out of the Java-induced SIGSEGV, gdb just gets another SIGSEGV in the same location after "continue" - and so ad infinitum (or at least for significantly more than the hypothetical 128 threads that should not even get instantiated on a less well endowed machine).
Seems the docker reproducer segfaults on my system with or without libopenblas preloaded, and never gets far enough to even enter level3_syrk_threaded.c under gdb with "handle 11 nostop noprint nopass"
Try this:
$ docker run --rm -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined fedora:32
$ dnf update -y && dnf install -y octave-statistics gdb openblas-*
$ CMD='octave -H -q --no-window-system --no-site-file --eval pkg("load","statistics");test("/usr/share/octave/packages/statistics-1.4.1/canoncorr.m");'
$ gdb --args env LD_PRELOAD=/usr/lib64/libopenblaso.so.0 $CMD
(gdb) r
...
(gdb) c
Continuing.
[New Thread 0x7fffbe122700 (LWP 1339)]
...
Thread 1 "octave-cli-5.2." received signal SIGSEGV, Segmentation fault.
0x00007ffff5ddb38d in dsyrk_thread_UT () from /usr/lib64/libopenblaso.so.0
(gdb) bt
#0 0x00007ffff5ddb38d in dsyrk_thread_UT () from /usr/lib64/libopenblaso.so.0
#1 0x00007ffff5cdc79f in dsyrk_ () from /usr/lib64/libopenblaso.so.0
#2 0x00007ffff41a677e in xgemm(Matrix const&, Matrix const&, blas_trans_type, blas_trans_type) ()
from /usr/lib64/octave/5.2.0/liboctave.so.7
...
And if you add dnf debuginfo-install openblas-openmp
at the beginning, the backtrace contains more detail, obviously.
Thanks, that's better. Unfortunately it still does not explain what happens - arguments go from totally sane in interface/syrk.c to unreadable garbage in level3_syrk_threaded.c as far as gdb is concerned, and valgrind is effectively neutralised by the java segfault. Retrying the bisect in the hope that it uncovers something this time.
Just managed to catch this from a NUM_THREADS=88 build now (which does not make much sense either, the m,n,k are 2,2,10 in the caller, i.e. dsyrk_UT) . Tried building with gcc's asan and ubsan but they did not find anything (except a known problem in ctest's way of collating the openblas_utest modules).
hread 1 "octave-cli-5.2." received signal SIGSEGV, Segmentation fault.
0x00007fffebeba154 in dsyrk_kernel_U (m=0, n=0, k=0, alpha_r=0, a=0x0, b=0x0, c=0x0, ldc=2, offset=0) at syrk_kernel.c:60
60 FLOAT *a, FLOAT *b, FLOAT *c, BLASLONG ldc, BLASLONG offset){
(gdb) bt
#0 0x00007fffebeba154 in dsyrk_kernel_U (m=0, n=0, k=0, alpha_r=0, a=0x0, b=0x0, c=0x0, ldc=2, offset=0) at syrk_kernel.c:60
#1 0x00007fffebea8c75 in dsyrk_UT (args=0x7fffffff9910, range_m=0x0, range_n=0x0, sa=0x7ffe5d222000, sb=0x7ffe5d322000, dummy=0) at level3_syrk.c:231
#2 0x00007fffebf03b04 in dsyrk_thread_UT (args=0x7fffffff9910, range_m=0x0, range_n=0x0, sa=0x7ffe5d222000, sb=0x7ffe5d322000, mypos=0) at level3_syrk_threaded.c:532
#3 0x00007fffeb89f2ce in dsyrk_ (UPLO=0x7fffe4424fed "U", TRANS=0x7fffffff9a6c "T\177", N=0x7fffffff9a58, K=0x7fffffff9a54, alpha=0x7fffffff9ad0, a=0x60e000155600,
ldA=0x7fffffff9a70, beta=0x7fffffff9aa0, c=0x603000755dd0, ldC=0x7fffffff9a58) at syrk.c:370
Makes sense if another thread is overwriting them. The question is how this happens and why with this NUM_THREADS and not with lower values.
It would be easier to understand if there were actually that many threads running (thinking low-probability race condition ) but as the number of threads is capped at the hardware capability the NUM_THREADS should only size the GEMM buffer here.
A race condition should trigger random failures. The consistency of the segfault suggests that threads are being assigned somehow a wrong position in the buffer maybe? I don't know the internals of this buffer, so just speculating here.
The buffer "only" contains the results of individual submatrix computations, any conflict there should not lead to overwriting of function arguments on the stack (I think).
I found that there's a BLAS3_MEM_ALLOC_THRESHOLD
of 160 here. Then USE_ALLOC_HEAP
is used if MAX_CPU_NUMBER > BLAS3_MEM_ALLOC_THRESHOLD
here. ~This happens (because MAX_CPU_NUMBER = NUM_THREADS * 2
) for NUM_THREADS = 128
and also your case, NUM_THREADS = 88
, but not for NUM_THREADS = 64
(which doesn't cause a segfault).~ Maybe this rings a bell?
NUM_THREADS=85 gave no segfault but I'll try playing with that threshold just in case. Maybe this is more related to gcc10 optimizing some already fragile code and the bigger buffer from a high NUM_THREADS just changes memory layout to make the consequences more interesting.
Note that my example above is replicable too if you switch from fedora:32
to fedora:31
, which ships gcc9.
Same crash after a tenfold increase of BLAS3_MEM_ALLOC_THRESHOLD. thread sanitizer found some unfriendly things to say about the sanity of level3_thread from a gemm perspective but did not complain about the code path taken by syrk. (Unfortunately it refuses to run in the java/gdb context, so this was from running the BLAS tests)
Actually decreasing the threshold (e.g. to 60) seems to make the crash go away, so it could be that it was the job array itself that is/was trashing the stack. Curiouser and curiouser...
BTW, I incorrectly stated that MAX_CPU_NUMBER = NUM_THREADS * 2
, but no, it's MAX_CPU_NUMBER = NUM_THREADS
and USAGE.md says that NUM_BUFFERS = MAX_CPU_NUMBER * 2
. But that's not true either.
Too early to celebrate, this could be a Heisenbug. I can also make it work by (only) increasing NUM_THREADS even further (208)...
Actually, NUM_THREADS > 160
triggers heap allocation. Sorry for the confusion above.
So this makes sense, right? MAX_CPU_NUMBER = NUM_THREADS = 128
seems to overflow the stack. If you decrease the threshold, heap allocation is used instead; if you increase NUM_THREADS
even further, heap allocation is used again. If you decrease NUM_THREADS
, stack allocation is used, but there is no overflow.
Behold:
$ _JAVA_OPTIONS="-Xss2048k" LD_PRELOAD=/lib64/libopenblaso.so.0 $CMD
Picked up _JAVA_OPTIONS: -Xss2048k
Segmentation fault (core dumped)
$ _JAVA_OPTIONS="-Xss4096k" LD_PRELOAD=/lib64/libopenblaso.so.0 $CMD
Picked up _JAVA_OPTIONS: -Xss4096k
PASSES 7 out of 7 tests
Ah, ok... this is where Java messes things up for us (again - there are examples in closed issues where the small default java stack caused crashes). I had already wondered if/how/where to set Xss but did not think of an environment variable. Default stack on most(?) Linux distros is 8192k
Default stack on most(?) Linux distros is 8192k
I think so. And default stack on most(?) Java installations is 1024k AFAIK. I suppose that BLAS3_MEM_ALLOC_THRESHOLD
was calculated for the default stack on Linux, and Java is messing with that. However, I'd say that a proper fix would be to have a threshold based on the actual stack size at runtime.
Might make sense to adjust behaviour according to a getrlimit call at runtime instead of relying on a fixed threshold, but I guess the whole point is to try to avoid the system call overhead. Alternatively, Octave could adjust their _JAVA_OPTIONS ?
That's possible, yes, because octave integrates Java. But I'd bet #2821 is the same issue, and scipy knows nothing about Java. So whenever Java is involved, this could happen. And the same can be said whenever a user changes the stack size.
The trick is that #2821 only crashes when he loads libhdfs.so, which appears to be java-based... I have just copied our current understanding of the problem there.
Yes, that's my point: octave can ensure that the proper stack size is set, but scipy doesn't use Java at all. But then, if someone uses scipy and another Java-based library in the same script, boom. Hadoop knows nothing about OpenBLAS requirements for the stack either. So in the end, it's the user's responsibility.
Could have sworn I had put it in the FAQ long ago (where few would read it anyway I guess). Can't Java just go away and die out ?
Can't Java just go away and die out?
:D Cannot agree more. But here we are, and many distributed technologies, big data tools and other scientific stuff are Java-based. ¯_(ツ)_/¯
Does Java modify the stack size of non-java spawned threads (the ones openblas creates)?
Yes, it does. It doesn't matter which code path spawns those threads: as soon as the JVM lives in the main process, new threads inherit those parameters.
try MAX_STACK_ALLOC=0
, it is something around stack "one size fits all" code, may not play nice with stack reduced by somebody else.
I already tried, and it doesn't help.
I do not quite like the idea of a runtime check (portability and all), perhaps reduce the default THRESHOLD to a value that is safe with the java default, and at the same time make it configurable so that anybody who is certain to not use java and to want every little gain in performance can restore the old behaviour (or even go beyond that with a suitable ulimit -s
) ? Back in the day everybody used to build the library for their specific needs but now with binary distributions and indirect dependencies it may be better to err on the side of caution.
With an upstream-developer hat on, I fully understand. With the Fedora hat on (pun intended), OpenBLAS already detects the number of CPUs available to set a proper number of threads, so I don't see how this should be different. :)
Point taken, and at least Windows gets special treatment already and getrlimit() should be sufficiently portable across anything unixoid. BTW relevant previous commit is https://github.com/xianyi/OpenBLAS/commit/32d2ca3035d76e18b2bc64c7bfbe3fad2dba234b from seven years ago, before that it used to be individual NUM_CORES-based limits in each affected source file (and incidentally the "Windows" value of the threshold is specifically for a 1MB stack).
Hmm. Don't really see a clean way to go from a (compile-time) decision between allocating on either heap or stack to a run-time choice on startup (that does not lose the advantages of stack allocation and does not clutter up the stack with effectively unused allocations either). Of course what could be done easily is a runtime warning that the current stack is too small - there is commented-out (and most probably ineffective) code from xianyi's #221 in memory.c to try and raise a "soft" stack limit to the "hard" maximum so he appears to have been there before)
In Fedora, we set
NUM_THREADS=128
for the openmp and threaded versions (see spec file for reference; cc @susilehtola). Recently, we switched to openblas-openmp as the system-wide default BLAS/LAPACK implementation. Then, we found out that a test in the octave-statistics package (canoncorr.m
) is segfaulting (octave was previously using openblas-serial), and we have managed to narrow down the issue to this point so far. Here's a reproducible example with the current master branch:but
Any idea what could be happening here?