Open dionhaefner opened 5 years ago
I did some digging. The process runs fine for several hundred kernels, then the JIT call to GCC fails for no reason. The kernel where compilation fails is unremarkable and varies between runs.
I noticed the following warning when starting the run:
--------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.
The process that invoked fork was:
Local host: [[14689,0],0] (PID 32590)
If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
It seems like fork
and thus subprocess::Popen
is not supported from applications running through MPI. Could this be causing the problem?
Works through MVAPICH2 instead of OpenMPI, so we can work with that. On the horizon, OpenMPI support would be nice though.
I tried running Bohrium on multiple nodes on the cluster, but it crashes with
This only happens when using more than 1 node (multiple processes on the same node work fine), so it might be another filesystem issue (#598)?
I tried disabling the persistent cache, to no avail.