bh107 / bohrium

Automatic parallelization of Python/NumPy, C, and C++ codes on Linux and MacOSX
http://www.bh107.org
Apache License 2.0
220 stars 31 forks source link

Bohrium not usable via OpenMPI with missing fork support #600

Open dionhaefner opened 5 years ago

dionhaefner commented 5 years ago

I tried running Bohrium on multiple nodes on the cluster, but it crashes with

pclose(): No such file or directory
pclose() failed.
terminate called after throwing an instance of 'std::runtime_error'
  what():  Compiler: pclose() failed
[node170:28446] *** Process received signal ***
[node170:28446] Signal: Aborted (6)
[node170:28446] Signal code:  (-6)
[node170:28446] [ 0] /lib64/libpthread.so.0[0x323a00f7e0]
[node170:28446] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x32398324f5]
[node170:28446] [ 2] /lib64/libc.so.6(abort+0x175)[0x3239833cd5]
[node170:28446] [ 3] /groups/ocean/software/clBLAS/gcc/2.12/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x15d)[0x2b10eb9fc5ad]
[node170:28446] [ 4] /groups/ocean/software/clBLAS/gcc/2.12/lib64/libstdc++.so.6(+0x8c636)[0x2b10eb9fa636]
[node170:28446] [ 5] /groups/ocean/software/clBLAS/gcc/2.12/lib64/libstdc++.so.6(+0x8c681)[0x2b10eb9fa681]
[node170:28446] [ 6] /groups/ocean/software/clBLAS/gcc/2.12/lib64/libstdc++.so.6(+0x8c898)[0x2b10eb9fa898]
[node170:28446] [ 7] /groups/ocean/software/bohrium/gcc/05102018/lib64/libbh.so(_ZNK7bohrium4jitk8Compiler7compileENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPKcm+0x20e)[0x2aaab2bdb0be]
[node170:28446] [ 8] /groups/ocean/software/bohrium/gcc/05102018/lib64/libbh_ve_openmp.so(_ZN7bohrium12EngineOpenMP11getFunctionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_+0x5f6)[0x2aaabef0e0f6]
[node170:28446] [ 9] /groups/ocean/software/bohrium/gcc/05102018/lib64/libbh_ve_openmp.so(_ZN7bohrium12EngineOpenMP7executeERKNS_4jitk11SymbolTableERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEmRKSt6vectorIPK14bh_instructionSaISG_EE+0x187)[0x2aaabef0e597]
[node170:28446] [10] /groups/ocean/software/bohrium/gcc/05102018/lib64/libbh.so(_ZN7bohrium4jitk9EngineCPU15handleExecutionEP4BhIR+0x1a3c)[0x2aaab2c1ddbc]
[node170:28446] [11] /groups/ocean/software/bohrium/gcc/05102018/lib64/libbh_ve_openmp.so(+0x1b672)[0x2aaabef19672]
[node170:28446] [12] /groups/ocean/software/bohrium/gcc/05102018/lib64/libbh_vem_node.so(+0x43c6)[0x2aaabecfa3c6]
[node170:28446] [13] /groups/ocean/software/bohrium/gcc/05102018/lib64/libbhxx.so(+0x46f69)[0x2aaab2e92f69]
[node170:28446] [14] /groups/ocean/software/bohrium/gcc/05102018/lib64/libbhxx.so(_ZN4bhxx7Runtime5flushEv+0x37)[0x2aaab2e931c7]
[node170:28446] [15] /groups/ocean/software/bohrium/gcc/05102018/lib64/python2.7/site-packages/bohrium/_bh.so(PyFlush+0x9)[0x2aaab2699e69]
[node170:28446] [16] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x74a6)[0x2b10aa352b56]
[node170:28446] [17] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0xacbe)[0x2b10aa35636e]
[node170:28446] [18] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xe67)[0x2b10aa357c17]
[node170:28446] [19] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0xaa2c)[0x2b10aa3560dc]
[node170:28446] [20] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0xacbe)[0x2b10aa35636e]
[node170:28446] [21] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xe67)[0x2b10aa357c17]
[node170:28446] [22] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(+0x98cff)[0x2b10aa298cff]
[node170:28446] [23] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyObject_Call+0x47)[0x2b10aa2597c7]
[node170:28446] [24] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x1d65)[0x2b10aa34d415]
[node170:28446] [25] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xe67)[0x2b10aa357c17]
[node170:28446] [26] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(+0x98cff)[0x2b10aa298cff]
[node170:28446] [27] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyObject_Call+0x47)[0x2b10aa2597c7]
[node170:28446] [28] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x1d65)[0x2b10aa34d415]
[node170:28446] [29] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xe67)[0x2b10aa357c17]
[node170:28446] *** End of error message ***

This only happens when using more than 1 node (multiple processes on the same node work fine), so it might be another filesystem issue (#598)?

I tried disabling the persistent cache, to no avail.

dionhaefner commented 5 years ago

I did some digging. The process runs fine for several hundred kernels, then the JIT call to GCC fails for no reason. The kernel where compilation fails is unremarkable and varies between runs.

I noticed the following warning when starting the run:

--------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          [[14689,0],0] (PID 32590)

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------

It seems like fork and thus subprocess::Popen is not supported from applications running through MPI. Could this be causing the problem?

dionhaefner commented 5 years ago

Works through MVAPICH2 instead of OpenMPI, so we can work with that. On the horizon, OpenMPI support would be nice though.