OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.33k stars 1.49k forks source link

make error: vfork: resource temporarily unavailable #1348

Closed timjim333 closed 6 years ago

timjim333 commented 6 years ago

Hi,

I'm trying to install OpenBLAS 0.2.20 on a node in a local directory over which I have permissions (I don't have root access). I seem to be encountering an error when attempting the build process. Trying make with or without any flags is resulting in a long string of errors that look like: make[1]: vfork: Resource temporarily unavailable

I've posted the whole output in a text file. Could anyone give any suggestions on how to troubleshoot the problem? make_error.txt

Many thanks. Tim

EDIT: In case this is useful, here is also the output of a few server parameters.

uname - or 2.6.32.54-0.3-default GNU/Linux

lsb_release -a LSB Version: core-2.0-noarch:core-3.2-noarch:core-4.0-noarch:core-2.0-x86_64:core-3.2-x86_64:core-4.0-x86_64:desktop-4.0-amd64:desktop-4.0-noarch:graphics-2.0-amd64:graphics-2.0-noarch:graphics-3.2-amd64:graphics-3.2-noarch:graphics-4.0-amd64:graphics-4.0-noarch Distributor ID: SUSE LINUX Description: SUSE Linux Enterprise Server 11 (x86_64) Release: 11 Codename: n/a

cat /etc/*-release LSB_VERSION="core-2.0-noarch:core-3.2-noarch:core-4.0-noarch:core-2.0-x86_64:core-3.2-x86_64:core-4.0-x86_64" SGI Accelerate 1.3, Build 705r10.sles11-1110192111 SGI Foundation Software 2.5, Build 705r10.sles11-1110192111 SGI MPI 1.3, Build 705r10.sles11-1110192111 SGI Performance Suite 1.3, Build 705r10.sles11-1110192111 SGI UPC 1.3, Build 705r10.sles11-1110192111 SUSE Linux Enterprise Server 11 (x86_64) VERSION = 11 PATCHLEVEL = 1

lscpu: Architecture: x86_64 CPU(s): 64 Thread(s) per core: 1 Core(s) per socket: 8 CPU socket(s): 8 NUMA node(s): 8 Vendor ID: GenuineIntel CPU family: 6 Model: 46 Stepping: 6 CPU MHz: 2266.424 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 24576K

timjim333 commented 6 years ago

Is this what you mean? I called run quite a few times but it did not crash. Should I keep calling run?gdb_test3.txt

martin-frbg commented 6 years ago

Can you try with spaces where you put the ampersands, to make sure the THREADS setting ends up in the same environment that sblat1 sees ? OPENBLAS_NUM_THREADS=2 gdb ./sblat1

brada4 commented 6 years ago

Something is different between when the test runs and under GDB but I dont understand what.

martin-frbg commented 6 years ago

Maybe there is a memory problem somewhere in the BIGNUMA code, but at the very least I would expect it to get as far as previous runs, i.e. to the point where the code printed the "cannot open shared memory" message and should be printing a more detailed message including a reason now. So my bet is still on the OPENBLAS_NUM_THREADS setting not getting seen by the code.

timjim333 commented 6 years ago

That did the trick and triggered the seg fault. gdb_test4.txt

timjim333 commented 6 years ago

I seem to have had a successful compile - under the suggestion of the admin, I added the NO_AFFINITY flag in the make call: make PREFIX=/home/FIa/FIa164/programs/openblas/OpenBLAS-0.2.20 FC=gfortran BIGNUMA=1 NO_AFFINITY=1. The output can be seen here: make_output.txt This is still with the replaced init.c that you sent earlier. Has this produced the expected result and will installing this build produce a working copy of OpenBLAS?

brada4 commented 6 years ago

Yes, you have usable OpenBLAS, and you pointed to issue that BIGNUMA=1+NO_AFFINITY=0 causes failure initialising thread affinity.

martin-frbg commented 6 years ago

Yes, looks like this produced a usable build (though performance may be decreased by threads getting rescheduled to a different processor occasionally). I will need to look at the code path leading to the shared memory allocation again, seems the informational message I added to the latest init.c may be wrong and the code will just lose cpu affinity but not multithreading when the shmget fails. Unfortunately the gdb backtrace does not show the failing call, OpenBLAS would need to be build with DEBUG=1 to add the necessary symbols for the debugger. (Apologies for not mentioning this earlier). So far all that can be learned is that it segfaults in the routine that initiates the previously failing calls to shmget/shmat - maybe now that these failures are handled, it staggers on a bit beyond them.

timjim333 commented 6 years ago

I see, so in the meantime I can link to this build then. Meanwhile, I'm happy to rebuild with a debug flag if that helps get to the root of the problem.

On 17 Nov 2017 18:21, "Martin Kroeker" notifications@github.com wrote:

Yes, looks like this produced a usable build (though performance may be decreased by threads getting rescheduled to a different processor occasionally). I will need to look at the code path leading to the shared memory allocation again, seems the informational message I added to the latest init.c may be wrong and the code will just lose cpu affinity but not multithreading when the shmget fails. Unfortunately the gdb backtrace does not show the failing call, OpenBLAS would need to be build with DEBUG=1 to add the necessary symbols for the debugger. (Apologies for not mentioning this earlier). So far all that can be learned is that it segfaults in the routine that initiates the previously failing calls to shmget/shmat - maybe now that these failures are handled, it staggers on a bit beyond them.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xianyi/OpenBLAS/issues/1348#issuecomment-345188965, or mute the thread https://github.com/notifications/unsubscribe-auth/AQD-Evr8VL3JLk0cuOU11j_d_dE9-V19ks5s3U__gaJpZM4QSyL9 .

brada4 commented 6 years ago

... and once permanent code fix is in place ... just replace library file with new one... i.e dont close the issue, your help with big system testing will be handy.

timjim333 commented 6 years ago

Right, so I will rebuild with make PREFIX=/home/FIa/FIa164/programs/openblas/OpenBLAS-0.2.20 FC=gfortran BIGNUMA=1 DEBUG=1 and attempt torun the gdb tet again.

timjim333 commented 6 years ago

I built a a debug build using the following and ran the gdb test. I've attached the results below - I hope it helps.

unset FC F90 F90FLAGS FFLAGS
make PREFIX=/home/FIa/FIa164/programs/openblas/TEST_OpenBLAS-0.2.20 FC=gfortran BIGNUMA=1 DEBUG=1

make_debug.txt debug_out.txt

martin-frbg commented 6 years ago

Not sure yet what to make of this - the node_info array is declared to hold MAX_NODES (128) and MAX_BITMASK_LEN members so I would not expect it to overflow in this initialization loop (unless the calculation of MAX_BITMASK_LEN went wrong much earlier - this is the number of cpus as returned by the CPU_SETSIZE macro divided by 64). Can you do

print j
print  node
print MAX_BITMASK_LEN

where you currently have the t a a bt please ?

timjim333 commented 6 years ago

Here is the output. There was no MAX_BITMASK_LEN: debug_out2.txt

martin-frbg commented 6 years ago

Hmm. Seems one would need to build with a higher debug level flag to get access to #defined constants. But checking on my system, CPU_SETSIZE appears to be 4096 (although the manpage claims 1024) on all Linux platforms at the moment, and 8*sizeof(unsigned long) should be either 32 or 64 so the loop should still do fine on element 59 of 64. Guess you could try replacing the single occurence of CPU_SETSIZE near the top of driver/others/init.c by some number that is only somewhat bigger than your actual 640 cpus, say 896 just to see if this changes anything. On the other hand I think it should be possible to trick a BIGNUMA build into running into the problematic code on my small system so maybe I can track this down myself.

martin-frbg commented 6 years ago

Well, replacing the CPU_SETSIZE by 896 works for me here, while with CPU_SETSIZE it happens to fail at j=59 (though different nodes number) as well. My current thinking is that the struct just gets too big to fit on the stack.

martin-frbg commented 6 years ago

Turns out there appears to be (or have been) some disagreement between glibc maintainers, in particular from SuSE about raising the value (as set in /usr/include/bits/sched.h) for CPU_SETSIZE from 1024 to 4096. (As far as I understood the discussion threads, this was to reflect increased capability of the Linux kernel when built with "maximum NUMA nodes" option, but the criticism was that it broke the API). The last discussion appears to have taken place here, which contains some hints for correct usage of sched_getaffinity() on big systems that may be relevant for OpenBLAS: https://sourceware.org/ml/libc-alpha/2016-03/msg00043.html At the very least it is conceivable that wernsaar's (experimental) code for BIGNUMA support was never tested, nor expected to work beyond the "traditional" value of 1024. In my absolutely unscientifc tests, 2048 appeared to still work (although I currently do not own a laptop capable of traversing all affected code paths :-) ) so my suggestion is to replace the current use of CPU_SETSIZE with a constant 2048 or 1024 as a quick fix.

martin-frbg commented 6 years ago

3232 appears to be about the limit that still survives the compile tests for me, 3264 is already crashing.

timjim333 commented 6 years ago

@martin-frbg in init.c, can I confirm, you mean that I should try replacing the below block:

#if defined(BIGNUMA)
// max number of nodes as defined in numa.h
// max cpus as defined in sched.h
#define MAX_NODES   128
#define MAX_CPUS    CPU_SETSIZE
#else
#define MAX_NODES   16
#define MAX_CPUS    256
#endif

with:

#if defined(BIGNUMA)
// max number of nodes as defined in numa.h
// max cpus as defined in sched.h
#define MAX_NODES   128
#define MAX_CPUS    1024
#else
#define MAX_NODES   16
#define MAX_CPUS    256
#endif

then make clean, recompile and run the gdb test, is that correct?

martin-frbg commented 6 years ago

Yes, exactly. If it works, you should already see the build pass all tests.

timjim333 commented 6 years ago

@martin-frbg Is there an alternative debug level I should set, or should I go with make PREFIX=/home/FIa/FIa164/programs/openblas/TEST_OpenBLAS-0.2.20 FC=gfortran BIGNUMA=1 DEBUG=1 again?

brada4 commented 6 years ago

You can add NO_LAPACK=1 NO_CBLAS=1 , it will not make complete library usable for you, but just bare minimum to get fastest to the test that fails. Sure 'make clean' between tries.

martin-frbg commented 6 years ago

I'd just do a normal build, assuming that with such a big system build time should not be an issue.

timjim333 commented 6 years ago

Alright, so sticking with DEBUG=1 then.

brada4 commented 6 years ago

In principle there is 'ar' invoked after blaS, then cblas, then lapack, then lapacke. In principle rebuilding last 3 is not necessary for rapid repeater.

martin-frbg commented 6 years ago

@timjim333 , did you get around to testing this ?

timjim333 commented 6 years ago

Yes, sorry for the delay. It appears to build without errors: debug_out3.txt

martin-frbg commented 6 years ago

Great, thanks for testing. I'll prepare the corresponding PR to fix this in the develop branch later.