I backtraced and see that the root process is blocked on ga_fill_patch while the rest of the processes are in ga_terminate.
root process
(gdb) bt
#0 0x0000778b189ec4a3 in ?? () from /lib/x86_64-linux-gnu/libmpich.so.12
#1 0x0000778b189edbb1 in ?? () from /lib/x86_64-linux-gnu/libmpich.so.12
#2 0x0000778b189eedeb in ?? () from /lib/x86_64-linux-gnu/libmpich.so.12
#3 0x0000778b1884d27a in PMPI_Barrier () from /lib/x86_64-linux-gnu/libmpich.so.12
#4 0x0000000000503952 in parmci_msg_barrier () at ../src/message.c:110
#5 0x000000000041a8a8 in pnga_msg_sync () at ../global/src/collect.c:149
#6 0x000000000049507a in pnga_sync () at ../global/src/onesided.c:166
#7 0x000000000046ec44 in pnga_fill_patch (g_a=-1000, lo=0x7ffc1e975fb0, hi=0x7ffc1e975fa0, val=0x7ffc1e9760e8)
at ../global/src/global.npatch.c:1920
#8 0x000000000042a5ce in sga_fill_patch (g_a=-1000, ilo=0x510368, ihi=0x510370, jlo=0x510368, jhi=0x510368,
val=0x7ffc1e9760e8) at ../global/src/fapi.c:2822
#9 0x000000000042a61a in ga_fill_patch_ (g_a=0x7ffc1e976380, ilo=0x510368, ihi=0x510370, jlo=0x510368, jhi=0x510368,
val=0x7ffc1e9760e8) at ../global/src/fapi.c:2827
#10 0x0000000000403d94 in testputgetacc1 (g_a=-1000, n=<optimized out>, chunk=..., num_chunks=<optimized out>,
buf=..., ilo=1, ihi=262144, jlo=1, jhi=1, local=.FALSE., buf=..., num_chunks=<optimized out>, chunk=...,
n=<optimized out>) at ../global/testing/mir_perf2.F:144
#11 0x0000000000404705 in test1d () at ../global/testing/mir_perf2.F:92
#12 perf () at ../global/testing/mir_perf2.F:32
#13 0x00000000004038dd in main (argc=<optimized out>, argv=<optimized out>) at ../global/testing/mir_perf2.F:39
non-root processes:
(gdb) bt
#0 0x000071212c7ebd47 in ?? () from /lib/x86_64-linux-gnu/libmpich.so.12
#1 0x000071212c7ed09d in ?? () from /lib/x86_64-linux-gnu/libmpich.so.12
#2 0x000071212c8010af in ?? () from /lib/x86_64-linux-gnu/libmpich.so.12
#3 0x000071212c63ee82 in PMPI_Allreduce () from /lib/x86_64-linux-gnu/libmpich.so.12
#4 0x0000000000502cbf in gmr_destroy (mreg=0x14bc090, group=0x73b6c0 <ARMCI_GROUP_WORLD>) at ../src/gmr.c:313
#5 0x000000000050221f in ARMCI_Free_group (ptr=0x7120fa7a0008, group=0x73b6c0 <ARMCI_GROUP_WORLD>)
at ../src/malloc.c:107
#6 0x0000000000502017 in PARMCI_Free (ptr=0x7120fa7a0008) at ../src/malloc.c:53
#7 0x00000000004f171a in pnga_destroy (g_a=-1000) at ../global/src/base.c:4015
#8 0x00000000004f1dae in pnga_terminate () at ../global/src/base.c:4151
#9 0x0000000000426229 in ga_terminate_ () at ../global/src/fapi.c:1116
#10 0x000000000040471c in perf () at ../global/testing/mir_perf2.F:36
#11 0x00000000004038dd in main (argc=<optimized out>, argv=<optimized out>) at ../global/testing/mir_perf2.F:39
I bisected to attribute the problem to this commit:
Removing sleep() should not break correctness. I am concerned that something is broken in the logic of these tests, or perhaps the underlying implementation, but since NWChem doesn't use this stuff, I don't think about it much.
I see global/testing/mir_perf2.x hanging on my workstation with any build of ARMCI/ComEx at the 2K limit.
I backtraced and see that the root process is blocked on
ga_fill_patch
while the rest of the processes are inga_terminate
.root process
non-root processes:
I bisected to attribute the problem to this commit:
Removing
sleep()
should not break correctness. I am concerned that something is broken in the logic of these tests, or perhaps the underlying implementation, but since NWChem doesn't use this stuff, I don't think about it much.