regression job randomly fails

Sage-Bionetworks / metanetworkSynapse

1 stars 7 forks source link

regression job randomly fails #18

Open blogsdon opened 7 years ago

blogsdon commented 7 years ago

High intensity compute jobs randomly fail. Appears to be an issue allocating the slaves. This is the error:

7228 Segmentation fault (core dumped) $R_HOME/bin/R --no-init-file --slave --no- save -f $1 > $hn.$2.$$.log 2>&1

and this is the output

Child job 2 terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.

blogsdon commented 7 years ago

failed on lassoCV1se

blogsdon commented 7 years ago

Interestingly, this happened again with this additional error:

A system call failed during shared memory initialization that should not have. It is likely that your MPI job will now either abort or experience performance degradation.

Local host: ip-10-0-0-22.ec2.internal System call: open(2) Error: No such file or directory (errno 2)

blogsdon commented 7 years ago

Additionally, this was the error that drove this

[ip-10-0-0-22.ec2.internal:70054] create_and_attach: unable to create shared memory BTL coordinating strucure :: size 67108864

[ip-10-0-0-22.ec2.internal:70018] 14 more processes have sent help message help-opal-shmem-mmap.txt / sys call fail [ip-10-0-0-22.ec2.internal:70018] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

blogsdon commented 7 years ago

Didn't see this problem after fixing the mpiWrapper function in the metanetwork package, hopefully that addressed the issue.

blogsdon commented 7 years ago

looks like it still happens

blogsdon commented 7 years ago

going to try to replace mpirun with mpiexec and see if that fixes the issue as per suggestion that mpiexec is more robust than mpirun here: http://stackoverflow.com/questions/25287981/mpiexec-vs-mpirun

blogsdon commented 7 years ago

fixed in #21 did three full tests and it worked fine regardless of the cluster state

blogsdon commented 7 years ago

still seems to be a problem