OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.33k stars 1.49k forks source link

make error: vfork: resource temporarily unavailable #1348

Closed timjim333 closed 6 years ago

timjim333 commented 6 years ago

Hi,

I'm trying to install OpenBLAS 0.2.20 on a node in a local directory over which I have permissions (I don't have root access). I seem to be encountering an error when attempting the build process. Trying make with or without any flags is resulting in a long string of errors that look like: make[1]: vfork: Resource temporarily unavailable

I've posted the whole output in a text file. Could anyone give any suggestions on how to troubleshoot the problem? make_error.txt

Many thanks. Tim

EDIT: In case this is useful, here is also the output of a few server parameters.

uname - or 2.6.32.54-0.3-default GNU/Linux

lsb_release -a LSB Version: core-2.0-noarch:core-3.2-noarch:core-4.0-noarch:core-2.0-x86_64:core-3.2-x86_64:core-4.0-x86_64:desktop-4.0-amd64:desktop-4.0-noarch:graphics-2.0-amd64:graphics-2.0-noarch:graphics-3.2-amd64:graphics-3.2-noarch:graphics-4.0-amd64:graphics-4.0-noarch Distributor ID: SUSE LINUX Description: SUSE Linux Enterprise Server 11 (x86_64) Release: 11 Codename: n/a

cat /etc/*-release LSB_VERSION="core-2.0-noarch:core-3.2-noarch:core-4.0-noarch:core-2.0-x86_64:core-3.2-x86_64:core-4.0-x86_64" SGI Accelerate 1.3, Build 705r10.sles11-1110192111 SGI Foundation Software 2.5, Build 705r10.sles11-1110192111 SGI MPI 1.3, Build 705r10.sles11-1110192111 SGI Performance Suite 1.3, Build 705r10.sles11-1110192111 SGI UPC 1.3, Build 705r10.sles11-1110192111 SUSE Linux Enterprise Server 11 (x86_64) VERSION = 11 PATCHLEVEL = 1

lscpu: Architecture: x86_64 CPU(s): 64 Thread(s) per core: 1 Core(s) per socket: 8 CPU socket(s): 8 NUMA node(s): 8 Vendor ID: GenuineIntel CPU family: 6 Model: 46 Stepping: 6 CPU MHz: 2266.424 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 24576K

martin-frbg commented 6 years ago

Looks like you are running out of resources - be it available memory or number of concurrent processes. Try reducing the number of subprocesses started by make - either set MAKE_NB_JOBS to a low number (it defaults to the number of cores detected, which may be inappropriate for your virtual system), or run with NO_PARALLEL_MAKE=1 (see Makefile.rule for available make options)

timjim333 commented 6 years ago

Thanks for the comment. I tried setting the NO_PARALLEL_MAKE option in Makefile.rule but I seem to have stumbled into another problem. I've attached the last few lines below. I'm fairly new to compiling these programs so I may have missed something obvious!

I had also set the prefix options and the gfortran thread safe options.

make[2]: Leaving directory /home/FIa/FIa164/programs/openblas/OpenBLAS-0.2.20_src/lapack/trtri' make[1]: Leaving directory/home/FIa/FIa164/programs/openblas/OpenBLAS-0.2.20_src/lapack' make[1]: Entering directory /home/FIa/FIa164/programs/openblas/OpenBLAS-0.2.20_src/lapack-netlib' ( cd INSTALL; make ) make[2]: Entering directory/home/FIa/FIa164/programs/openblas/OpenBLAS-0.2.20_src/lapack-netlib/INSTALL' gfortran -openmp -fp-model precise -I/ap/vni/imsl/fnl700/lnxsg111e64/include -O2 -frecursive -Wall -m64 -fPIC -c -o lsame.o lsame.f gfortran: precise: No such file or directory f951: error: unrecognized command line option "-fp-model" make[2]: [lsame.o] Error 1 make[2]: Leaving directory `/home/FIa/FIa164/programs/openblas/OpenBLAS-0.2.20_src/lapack-netlib/INSTALL' make[1]: [lapack_install] Error 2 make[1]: Leaving directory `/home/FIa/FIa164/programs/openblas/OpenBLAS-0.2.20_src/lapack-netlib' make: *** [netlib] Error 2

brada4 commented 6 years ago

Can you share custom "gfortran thread safe" parameters you are using? fp-model is Intel compiler flag.

Does it finish compilation by simple 'make clean ; make MAKE_NB_JOBS=1' ?

martin-frbg commented 6 years ago

Your fortran compiler does not understand the option "-fp-model precise" (which it probably picked up from some FLAGS variable in your environment - the option appears to be specific to the Intel compiler while you are using gfortran)

timjim333 commented 6 years ago

Thanks for the thoughts. This was the Makefile.rule I was using. Makefile.rule.txt

I will try again with 'make clean ; make MAKE_NB_JOBS=1'

timjim333 commented 6 years ago

It got a little closer! Using the above (only adding a prefix and MAKE_NB_JOBS=1), it seems that it has defaulted to using ifort, which seems to understand the '-fp-model precise' flag but has the following issue. The last few lines are posted below. Did it fail attempting to compile a fortran code?

... ifort -openmp -fp-model precise -I/ap/vni/imsl/fnl700/lnxsg111e64/include -O2 -fPIC -c ssytrd_sb2st.F -o ssytrd_sb2st.o ssytrd_sb2st.F(486): error #5082: Syntax error, found IDENTIFIER 'DEPEND' when expecting one of: UNTIED PRIVATE FIRSTPRIVATE SHARED IF DEFAULT ; !$OMP TASK DEPEND(in:WORK(MYID+SHIFT-1)) -----------^ ssytrd_sb2st.F(487): error #5082: Syntax error, found IDENTIFIER 'DEPEND' when expecting one of: ( % . = => !$OMP$ DEPEND(in:WORK(MYID-1)) -----------^ ssytrd_sb2st.F(488): error #5082: Syntax error, found IDENTIFIER 'DEPEND' when expecting one of: ( ) :: , ; + . - % (/ [ : ] /) . ' / ... !$OMP$ DEPEND(out:WORK(MYID)) -----------^ ssytrd_sb2st.F(488): error #5082: Syntax error, found END-OF-STATEMENT when expecting one of: : !$OMP$ DEPEND(out:WORK(MYID)) ---------------------------------^ ssytrd_sb2st.F(497): error #5082: Syntax error, found IDENTIFIER 'DEPEND' when expecting one of: UNTIED PRIVATE FIRSTPRIVATE SHARED IF DEFAULT ; !$OMP TASK DEPEND(in:WORK(MYID+SHIFT-1)) -----------^ ssytrd_sb2st.F(498): error #5082: Syntax error, found IDENTIFIER 'DEPEND' when expecting one of: ( % . = => !$OMP$ DEPEND(out:WORK(MYID)) -----------^ ssytrd_sb2st.F(497): error #6404: This name does not have a type, and must have an explicit type. [DEPEND] !$OMP TASK DEPEND(in:WORK(MYID+SHIFT-1)) -----------^ ssytrd_sb2st.F(497): error #6514: A substring must be of type CHARACTER. [DEPEND] !$OMP TASK DEPEND(in:WORK(MYID+SHIFT-1)) -----------^ ssytrd_sb2st.F(497): error #6404: This name does not have a type, and must have an explicit type. [IN] !$OMP TASK DEPEND(in:WORK(MYID+SHIFT-1)) ------------------^ ssytrd_sb2st.F(498): error #6514: A substring must be of type CHARACTER. [DEPEND] !$OMP$ DEPEND(out:WORK(MYID)) -----------^ ssytrd_sb2st.F(498): error #6404: This name does not have a type, and must have an explicit type. [OUT] !$OMP$ DEPEND(out:WORK(MYID)) ------------------^ compilation aborted for ssytrd_sb2st.F (code 1) make[2]: [ssytrd_sb2st.o] Error 1 make[2]: Leaving directory /home/FIa/FIa164/programs/openblas/OpenBLAS-0.2.20_src/lapack-netlib/SRC' make[1]: *** [lapacklib] Error 2 make[1]: Leaving directory/home/FIa/FIa164/programs/openblas/OpenBLAS-0.2.20_src/lapack-netlib' make: *** [netlib] Error 2

brada4 commented 6 years ago

You can try recording output from compilation with "script"

brada4 commented 6 years ago

Can you run plain "make" in freshly cloned/unzipped OpenBLAS 0.2.20 or git develop branch? You inherit some *_FLAGS from unrelated products, probably built next door. i dare not to guess what could be vni/imsl/fnl700/lnxsg111e64

martin-frbg commented 6 years ago

Which version of ifort is that ? Apparently it does not cope with some of the OPENMP directives, which would suggest a rather old version of the Intel compiler. (Having what appears to be the include file path of the IMSL library in the compiler options probably does not hurt, but is not useful either - you may want to check with whoever administers this system which compiler and options to use there)

timjim333 commented 6 years ago

Hi, I had a chat with one of the admins and they suggested switching to a more modern node - you were correct, the previous was an old one! I tried a fresh build using make PREFIX=/home/FIa/FIa164/programs/openblas/OpenBLAS-0.2.20. (I assume adding the prefix is ok? Since I don't have permissions for the other directories).

Please see attached for the output script. make_output.txt

There still seems to be some kind of issue with the compile? Thanks for the help.

If you nmeed the info, here are the new node parameters: uname -or 3.0.101-108.13-default GNU/Linux

lsb_release -a LSB Version: core-2.0-noarch:core-3.2-noarch:core-4.0-noarch:core-2.0-x86_64:core-3.2-x86_64:core-4.0-x86_64:desktop-4.0-amd64:desktop-4.0-noarch:graphics-2.0-amd64:graphics-2.0-noarch:graphics-3.2-amd64:graphics-3.2-noarch:graphics-4.0-amd64:graphics-4.0-noarch Distributor ID: SUSE LINUX Description: SUSE Linux Enterprise Server 11 (x86_64) Release: 11 Codename: n/a

lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 640 On-line CPU(s) list: 0-639 Thread(s) per core: 1 Core(s) per socket: 10 Socket(s): 64 NUMA node(s): 64 Vendor ID: GenuineIntel CPU family: 6 Model: 62 Stepping: 4 CPU MHz: 2399.982 BogoMIPS: 4802.71 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 25600K NUMA node0 CPU(s): 0-9 ... and a lot of NUMA nodes here ...

brada4 commented 6 years ago

Compiler problems cannot be solved in absence of versions, and probably you need to approach intel support website if you want to interlink their modules with gcc ones.

I wonder if this is relevant? https://github.com/xianyi/OpenBLAS/wiki/Precompiled-installation-packages#opensusesle

Can you cut last CPU core ID from /proc/cpuinfo, it has all the sibling numbers, and most notably lets identify CPU generation you have.

martin-frbg commented 6 years ago

The Intel fortran compiler on that node still seems unable to handle OpenMP4.0 directives. Do you want to use it anyway, or is it just picked as first choice because of something in your environment ? (You could try with make FC=gfortran after a make clean, assuming that the GNU fortran compiler is installed on that node as well)

brada4 commented 6 years ago

Actually "make clean" is absent in all build logs....

brada4 commented 6 years ago

What are compiler versions? gcc --version ; gfortran --version btw LSB does not define any compiler standard interface, no need to post it at all.

timjim333 commented 6 years ago

Hi, thanks for the posts above.

@martin-frbg I didn't have a specific requirement to use ifort as such - I'm just not aware of the difference. Does the choice of compiler actually affect the compiled program in any way? (I'm assuming performance or compatibility?) Essentially, I want to build BLAS and LAPACK for use with a CFD solver (SU2). I'll try with make FC=gfortran and see if that makes a difference!

@brada4 I had rm -r 'd the directory and unzipped a fresh copy before starting script! I'm not sure which info is relevant, so I've attached a copy of the cpuinfo file here: cpuinfo.txt Regarding versions: gcc (SUSE Linux) 4.3.4 [gcc-4_3-branch revision 152973] GNU Fortran (SUSE Linux) 4.3.4 [gcc-4_3-branch revision 152973]

timjim333 commented 6 years ago

Here is the output using gfortran make PREFIX=/home/FIa/FIa164/programs/openblas/OpenBLAS-0.2.20 FC=gfortran: make_output.txt

It seems even is it still not recognising some of the commands. Do I have to setup the server in any other way? Thanks.

martin-frbg commented 6 years ago

Now you are calling GNU gfortran with the options for the Intel compiler, which it does not recognize. Please try to find out where these are coming from (probably a FFLAGS setting in your shell). Regarding the choice of compiler - both would normally be expected to work, and some may say that the Intel compiler still generates faster code. Your installation of the Intel compiler however appears to be either ancient or incomplete, making it unable to handle some statements in the code.

timjim333 commented 6 years ago

After a quick google, FFLAGS are fortran flags is that correct? I did a quick echo $FFLAGS and it returned the following: -openmp -axAVX -fp-model source -I/ap/vni/imsl/fnl701/lnxsg121e64/include

I don't know if they are necessary for anything on the node, but I suppose that these are the ones causing issues? How should I remove them temporarily? Is it safe to do so?

martin-frbg commented 6 years ago

unset FFLAGS should do the trick. (You will need them only when you want to build something with the Intel fortran compiler, and depending on where and when these defaults were added to your user environment they may not even be appropriate for the current compute node)

timjim333 commented 6 years ago

I see, so they could be remnants of an admin's setup? After running unset FFLAGS, should I try the default FC or go with gfortran again? Thanks @martin-frbg

martin-frbg commented 6 years ago

Go with gfortran (or try to get an admin to look into why your intel compiler installation does not handle openmp4 directives - as far as I can find out, these should be supported by ifort since the 2015 version. Maybe yours is just too old - your gfortran is quite old as well)

brada4 commented 6 years ago

can you attach what "set" outputs so we can help you to tidy it up? The problem is caused by openblas build system using all sorts of CFLAGS etc temporarily internally and not making good friends if they are set ahead of time. btw OpenBLAS builds (or at least used to recently) on RHEL5 with even older gcc.

timjim333 commented 6 years ago

Hello, after a chat with the admins, I'll try to build the latest gfortran for this (and other builds I need) before trying again.

I don't know if this will still be an issue, but I tried building with a slightly newer version of gcc packaged with openfoam but received an error regarding max_nodes (NUMA) is too small? I've attached the output here: make.txt

Also here is the result of set: set.txt

Thanks for the install support.

brada4 commented 6 years ago

You must use BIGNUMA=1 build flag for >256 CPU cores (see Makefile.rule) For variables: $unset FC F90 F90FLAGS $make clean ; make BIGNUMA=1

timjim333 commented 6 years ago

I just built gfortran 7.2.0 in my local home directory and attempted to run the make process again, specifically running: make clean unset FC F90 F90FLAGS FFLAGS make PREFIX=/home/FIa/FIa164/programs/openblas/OpenBLAS-0.2.20 FC=gfortran BIGNUMA=1

It now seems to reach the test stage before failing with GotoBLAS : Can't open shared memory. Terminated. The full output is attached here: make.txt And to confirm versions: versions.txt

martin-frbg commented 6 years ago

Strange duplicity as I have just committed a change that will show a more detailed error message when shared memory allocation fails. The background for that was #1351 where use of SELinux apparently blocked access to shared memory segments, though in your case it happens even earlier when OpenBLAS tries to allocate a shared memory segment of size 32768. Perhaps this information gives your local sysadmin a clue already ? There is nothing in the manpage for shmget() that suggests we are trying something naughty, though for smaller systems without BIGNUMA=1 set the requested size is only 4096 here. Unfortunately I have no personal experience with your class of hardware.

martin-frbg commented 6 years ago

From #1351, you should be able to get around this by setting setenforce 0 if it happens to be SELinux-related, provided you have sufficient rights to use the setenforce command. Interventions by this kind of security-hardening software will probably also be logged somewhere on the system.

timjim333 commented 6 years ago

@martin-frbg it seems like the setenforce command does not exist on my distro - at least I was presented with: bash: setenforce: command not found. I think the supercomputer is running SUSE. Should I try to run a single core compile with NUM_THREADS=1? Or will this not help?

martin-frbg commented 6 years ago

You would probably need to be root to use setenforce, but it can also be that SELinux is not installed. A single core compile would certainly work but it depends on your use case if a single-threaded OpenBLAS library is suitable for you - if you want to use it from a program that already creates multiple threads for its calculations it would probably be fine even in terms of performance. If everything is singlethreaded, you would obviously lose most of the advantages of the multicore system, but at least you would be able to run programs that make use of BLAS or LAPACK functions (and still get some speedup compared to the netlib reference implementation).

timjim333 commented 6 years ago

Do I have many troubleshooting options left for a multi threaded install?

On 10 Nov 2017 19:02, "Martin Kroeker" notifications@github.com wrote:

You would probably need to be root to use setenforce, but it can also be that SELinux is not installed. A single core compile would certainly work but it depends on your use case if a single-threaded OpenBLAS library is suitable for you - if you want to use it from a program that already creates multiple threads for its calculations it would probably be fine even in terms of performance. If everything is singlethreaded, you would obviously lose most of the advantages of the multicore system, but at least you would be able to run programs that make use of BLAS or LAPACK functions (and still get some speedup compared to the netlib reference implementation).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/xianyi/OpenBLAS/issues/1348#issuecomment-343430610, or mute the thread https://github.com/notifications/unsubscribe-auth/AQD-Etan2NhprrRpr0RdowkpCJ9bhIPnks5s1B9KgaJpZM4QSyL9 .

martin-frbg commented 6 years ago

Should be fine as long as the trouble does not start to shoot back (though with that big supercomputer, you may be venturing where few have gone before). You'd need to get the current development snapshot of OpenBLAS from github to get the better error messages (use the green "Clone or Download" button to the right of the table heading on the "Code" tab here to either follow the instructions for a "git clone" call or to download a zip archive). And if possible talk to some local admin about what could cause accesses to shared memory to fail (maybe you just need to get appropriate user permissions on that supercomputer)

timjim333 commented 6 years ago

Hi, I have spoken to the admin and they said they will investigate. In the meantime, I downloaded a new copy of develop and attempted a fresh install. I have attached a copy of the output in the following file. It appears to finish compiling but fails on the tests with a 'Segmentation fault' - does the new install present any new information? Many thanks. make_output.txt

martin-frbg commented 6 years ago

Strange that it segfaults now without printing the cause of the shmget error, maybe I did something wrong in that recent change after all. (Could be the last error number (errno) is already undefined when I try to print it, as the shmget is retried in a loop.)

martin-frbg commented 6 years ago

If you have the ipcs command available, you could run ipcs -l to see what if any shared memory resources are available to you.

brada4 commented 6 years ago

You can run your test program withing gdb (or admin can do) As minimum show output of:

gdb ./testprog gdb> run Signal crash whatever gdb> t a a bt (this is the interesting thing, might be long for 1000 threads)

martin-frbg commented 6 years ago

@brada4 I think he (still) does not get far enough to have more than one thread - the fundamental problem is still the same (unable to obtain shared memory, hence unable to set up two threads in the tests), only now there is a segfault without any error message instead of the more verbose message I had committed for this case. The only reason for this (I think) is that errno has become undefined by the time the code reaches the perror() call and it is the perror printout itself now that is segfaulting due to this.

brada4 commented 6 years ago

There should be some grain in bt-s

martin-frbg commented 6 years ago

Methinks driver/others/init.c should be updated to this here (rename back from .txt to .c) init.txt but I am having trouble updating my fork for a regular PR at the moment. @timjim333 perhaps you can just replace the file and recompile to see if this brings some more useful info ?

timjim333 commented 6 years ago

I'll give the recompile a go. Meanwhile, it seems that the server has ipcs, here is the output:

FIa164@afispb07:~/programs/openblas/OpenBLAS-0.2.20_src$ ipcs -l

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509481983
max total shared memory (kbytes) = 4611686018427386880
min seg size (bytes) = 1

------ Semaphore Limits --------
max number of arrays = 1024
max semaphores per array = 250
max semaphores system wide = 256000
max ops per semop call = 32
semaphore max value = 32767

------ Messages Limits --------
max queues system wide = 32768
max size of message (bytes) = 65536
default max size of queue (bytes) = 65536
martin-frbg commented 6 years ago

Nothing obvious from that, looks similar to my laptop except perhaps for the memory size ...

timjim333 commented 6 years ago

@brada4 apologies, I'm not sure I follow regarding your post on gdb. I could start gdb but I'm not sure what testprog file to test in particular - what would be a sufficient? Is signal crash a ctrl-c? Thanks.

timjim333 commented 6 years ago

And here is the ouput from a recompile with the new init.c (I ran make clean before this). make_output.txt

brada4 commented 6 years ago

Should be gdb ./sblat1 in tests/

timjim333 commented 6 years ago

I attempted to run the test but not a lot came out of it: gdb_test.txt

Was this the expected output?

martin-frbg commented 6 years ago

Can you try OPENBLAS_NUM_THREADS=2 gdb ./sblat1 please ?

timjim333 commented 6 years ago

Please see attached. Not much seems to have changed! gdb_test2.txt

martin-frbg commented 6 years ago

Nothing could have changed, please put that on one line

brada4 commented 6 years ago

You need to repeat "r" maybe multiple times until it crashes with same message as during build.

timjim333 commented 6 years ago

@brada4 Sorry, I'm not sure what you mean? Should I enter r after calling run?

martin-frbg commented 6 years ago

Seems he wanted you to repeat the run (from within gdb) as it did not crash the first time. My suspicion is that you need to have the OPENBLAS_NUM_THREADS=2 and gdb ./sblat1 on the same command line to ensure the test actually runs with two threads (but that depends on the shell you use, in bash that is usually the default on SuSE your approach is fine).