Yandell-Lab / maker

Genome Annotation Pipeline
http://yandell-lab.org/software/maker.html
Other
32 stars 1 forks source link

MPI Errors when running maker 3.01.04 #22

Open hans-vg opened 1 week ago

hans-vg commented 1 week ago

Recently, I updated from v2 to v3 maker for a new annotation project. I compiled maker v3 using the same MPICH module I used previously for maker v2.

module load mpich/ge/gcc/64/3.3.2

However, now when I run maker in MPI mode, crashes after 3-20 hours of processing.

Any suggestions on how to troubleshoot or get MPI to run would be greatly appreciated.

Thank you, -Hans

Below are some example errors:

FATAL: Thread terminated, causing all processes to fail
--> rank=69, hostname=cpu-54
[proxy:0:2@cpu-55] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
[proxy:0:2@cpu-55] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:2@cpu-55] main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
[proxy:0:0@cpu-53] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
[proxy:0:0@cpu-53] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@cpu-53] main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
srun: error: cpu-53: task 0: Exited with exit code 7
srun: error: cpu-55: task 2: Exited with exit code 7
[mpiexec@cpu-53] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:75): one of the processes terminated badly; aborting
[mpiexec@cpu-53] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:22): launcher returned error waiting for completion
[mpiexec@cpu-53] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:215): launcher returned error waiting for completion
[mpiexec@cpu-53] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion
FATAL: Thread terminated, causing all processes to fail
--> rank=94, hostname=cpu-54
[proxy:0:2@cpu-55] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
[proxy:0:2@cpu-55] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:2@cpu-55] main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
[proxy:0:0@cpu-53] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
[proxy:0:0@cpu-53] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@cpu-53] main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
srun: error: cpu-55: task 2: Exited with exit code 7
srun: error: cpu-53: task 0: Exited with exit code 7
[mpiexec@cpu-53] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:75): one of the processes terminated badly; aborting
[mpiexec@cpu-53] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:22): launcher returned error waiting for completion
[mpiexec@cpu-53] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:215): launcher returned error waiting for completion
[mpiexec@cpu-53] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion
deleted:1 hits
Calling FastaDB::new at /data/gpfs/assoc/inbre/projects/software_installs/maker-Version_3.01.04/bin/../lib/FastaSeq.pm line 139.
Calling out to BioPerl get_PrimarySeq_stream at /data/gpfs/assoc/inbre/projects/software_installs/maker-Version_3.01.04/bin/../lib/GI.pm line 2287.
collecting tblastx reports
flattening altEST clusters
Fatal error in PMPI_Send: Unknown error class, error stack:
PMPI_Send(159).............: MPI_Send(buf=0x555559942d30, count=4, MPI_CHAR, dest=71, tag=1111, MPI_COMM_WORLD) failed
MPID_nem_tcp_connpoll(1845): Communication error with rank 71: Connection refused
carsonhh commented 1 week ago

What you provided is the STDERR that results fro from the MPI manager diving, the causal error will be further back in the output.

—Carson

On Oct 29, 2024, at 10:57 AM, Hans @.***> wrote:

Recently, I updated from v2 to v3 maker for a new annotation project. I compiled maker v3 using the same MPICH module I used previously for maker v2.

module load mpich/ge/gcc/64/3.3.2

However, now when I run maker in MPI mode, crashes after 3-20 hours of processing.

Any suggestions on how to troubleshoot or get MPI to run would be greatly appreciated.

Thank you, -Hans

Below are some example errors:

FATAL: Thread terminated, causing all processes to fail --> rank=69, hostname=cpu-54 @. HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed @. HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status @. main (pm/pmiserv/pmip.c:200): demux engine error waiting for event @. HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed @. HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status @. main (pm/pmiserv/pmip.c:200): demux engine error waiting for event srun: error: cpu-53: task 0: Exited with exit code 7 srun: error: cpu-55: task 2: Exited with exit code 7 @. HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:75): one of the processes terminated badly; aborting @. HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:22): launcher returned error waiting for completion @. HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:215): launcher returned error waiting for completion @. main (ui/mpich/mpiexec.c:336): process manager error waiting for completion FATAL: Thread terminated, causing all processes to fail --> rank=94, hostname=cpu-54 @. HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed @. HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status @. main (pm/pmiserv/pmip.c:200): demux engine error waiting for event @. HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed @. HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status @. main (pm/pmiserv/pmip.c:200): demux engine error waiting for event srun: error: cpu-55: task 2: Exited with exit code 7 srun: error: cpu-53: task 0: Exited with exit code 7 @. HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:75): one of the processes terminated badly; aborting @. HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:22): launcher returned error waiting for completion @. HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:215): launcher returned error waiting for completion @. main (ui/mpich/mpiexec.c:336): process manager error waiting for completion deleted:1 hits Calling FastaDB::new at /data/gpfs/assoc/inbre/projects/software_installs/maker-Version_3.01.04/bin/../lib/FastaSeq.pm line 139. Calling out to BioPerl get_PrimarySeq_stream at /data/gpfs/assoc/inbre/projects/software_installs/maker-Version_3.01.04/bin/../lib/GI.pm line 2287. collecting tblastx reports flattening altEST clusters Fatal error in PMPI_Send: Unknown error class, error stack: PMPI_Send(159).............: MPI_Send(buf=0x555559942d30, count=4, MPI_CHAR, dest=71, tag=1111, MPI_COMM_WORLD) failed MPID_nem_tcp_connpoll(1845): Communication error with rank 71: Connection refused — Reply to this email directly, view it on GitHub https://github.com/Yandell-Lab/maker/issues/22, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABEFX767TSGKFLMYAVPXYGDZ5647XAVCNFSM6AAAAABQ2KDGS6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGYZDCOBUGA3TEMY. You are receiving this because you are subscribed to this thread.