Open blogsdon opened 7 years ago
failed on lassoCV1se
Interestingly, this happened again with this additional error:
A system call failed during shared memory initialization that should
not have. It is likely that your MPI job will now either abort or
experience performance degradation.
Local host: ip-10-0-0-22.ec2.internal
System call: open(2)
Error: No such file or directory (errno 2)
Additionally, this was the error that drove this
[ip-10-0-0-22.ec2.internal:70054] create_and_attach: unable to create shared memory BTL coordinating strucure :: size 67108864
[ip-10-0-0-22.ec2.internal:70018] 14 more processes have sent help message help-opal-shmem-mmap.txt / sys call fail
[ip-10-0-0-22.ec2.internal:70018] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Didn't see this problem after fixing the mpiWrapper function in the metanetwork package, hopefully that addressed the issue.
looks like it still happens
going to try to replace mpirun with mpiexec and see if that fixes the issue as per suggestion that mpiexec is more robust than mpirun here: http://stackoverflow.com/questions/25287981/mpiexec-vs-mpirun
fixed in #21 did three full tests and it worked fine regardless of the cluster state
still seems to be a problem
High intensity compute jobs randomly fail. Appears to be an issue allocating the slaves. This is the error:
7228 Segmentation fault (core dumped) $R_HOME/bin/R --no-init-file --slave --no- save -f $1 > $hn.$2.$$.log 2>&1
and this is the output
Child job 2 terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.