GlobalArrays / ga

Partitioned Global Address Space (PGAS) library for distributed arrays
http://hpc.pnl.gov/globalarrays/
Other
101 stars 38 forks source link

global/testing/mir_perf2.x broken by lack of sleep #341

Closed jeffhammond closed 3 months ago

jeffhammond commented 3 months ago

I see global/testing/mir_perf2.x hanging on my workstation with any build of ARMCI/ComEx at the 2K limit.

                          Remote 1-D Array Section
     section           get               put           accumulate
  bytes    dim     sec      MB/s     sec      MB/s     sec      MB/s
      8      1  .924D-06 .866D+01 .119D-05 .674D+01 .878D-06 .911D+01
     72      9  .847D-06 .850D+02 .118D-05 .612D+02 .978D-06 .736D+02
    128     16  .851D-06 .150D+03 .119D-05 .107D+03 .981D-06 .130D+03
    648     81  .861D-06 .752D+03 .120D-05 .542D+03 .102D-05 .636D+03
   2048    256  .876D-06 .234D+04 .127D-05 .162D+04 .112D-05 .183D+04
<hangs>

I backtraced and see that the root process is blocked on ga_fill_patch while the rest of the processes are in ga_terminate.

root process

(gdb) bt
#0  0x0000778b189ec4a3 in ?? () from /lib/x86_64-linux-gnu/libmpich.so.12
#1  0x0000778b189edbb1 in ?? () from /lib/x86_64-linux-gnu/libmpich.so.12
#2  0x0000778b189eedeb in ?? () from /lib/x86_64-linux-gnu/libmpich.so.12
#3  0x0000778b1884d27a in PMPI_Barrier () from /lib/x86_64-linux-gnu/libmpich.so.12
#4  0x0000000000503952 in parmci_msg_barrier () at ../src/message.c:110
#5  0x000000000041a8a8 in pnga_msg_sync () at ../global/src/collect.c:149
#6  0x000000000049507a in pnga_sync () at ../global/src/onesided.c:166
#7  0x000000000046ec44 in pnga_fill_patch (g_a=-1000, lo=0x7ffc1e975fb0, hi=0x7ffc1e975fa0, val=0x7ffc1e9760e8)
    at ../global/src/global.npatch.c:1920
#8  0x000000000042a5ce in sga_fill_patch (g_a=-1000, ilo=0x510368, ihi=0x510370, jlo=0x510368, jhi=0x510368,
    val=0x7ffc1e9760e8) at ../global/src/fapi.c:2822
#9  0x000000000042a61a in ga_fill_patch_ (g_a=0x7ffc1e976380, ilo=0x510368, ihi=0x510370, jlo=0x510368, jhi=0x510368,
    val=0x7ffc1e9760e8) at ../global/src/fapi.c:2827
#10 0x0000000000403d94 in testputgetacc1 (g_a=-1000, n=<optimized out>, chunk=..., num_chunks=<optimized out>,
    buf=..., ilo=1, ihi=262144, jlo=1, jhi=1, local=.FALSE., buf=..., num_chunks=<optimized out>, chunk=...,
    n=<optimized out>) at ../global/testing/mir_perf2.F:144
#11 0x0000000000404705 in test1d () at ../global/testing/mir_perf2.F:92
#12 perf () at ../global/testing/mir_perf2.F:32
#13 0x00000000004038dd in main (argc=<optimized out>, argv=<optimized out>) at ../global/testing/mir_perf2.F:39

non-root processes:

(gdb) bt
#0  0x000071212c7ebd47 in ?? () from /lib/x86_64-linux-gnu/libmpich.so.12
#1  0x000071212c7ed09d in ?? () from /lib/x86_64-linux-gnu/libmpich.so.12
#2  0x000071212c8010af in ?? () from /lib/x86_64-linux-gnu/libmpich.so.12
#3  0x000071212c63ee82 in PMPI_Allreduce () from /lib/x86_64-linux-gnu/libmpich.so.12
#4  0x0000000000502cbf in gmr_destroy (mreg=0x14bc090, group=0x73b6c0 <ARMCI_GROUP_WORLD>) at ../src/gmr.c:313
#5  0x000000000050221f in ARMCI_Free_group (ptr=0x7120fa7a0008, group=0x73b6c0 <ARMCI_GROUP_WORLD>)
    at ../src/malloc.c:107
#6  0x0000000000502017 in PARMCI_Free (ptr=0x7120fa7a0008) at ../src/malloc.c:53
#7  0x00000000004f171a in pnga_destroy (g_a=-1000) at ../global/src/base.c:4015
#8  0x00000000004f1dae in pnga_terminate () at ../global/src/base.c:4151
#9  0x0000000000426229 in ga_terminate_ () at ../global/src/fapi.c:1116
#10 0x000000000040471c in perf () at ../global/testing/mir_perf2.F:36
#11 0x00000000004038dd in main (argc=<optimized out>, argv=<optimized out>) at ../global/testing/mir_perf2.F:39

I bisected to attribute the problem to this commit:

commit d9c78c147a19ac3fe1df3f2cc6037737e01d5369
Author: Ajay Panyala <ajay.panyala@gmail.com>
Date:   Fri Apr 19 12:33:49 2024 -0700

    remove sleep calls in perf tests

 cmake/f2c_dummy.h.in       |  5 +++--
 global/testing/mir_perf1.F | 23 ++++++-----------------
 global/testing/mir_perf2.F | 23 ++++++-----------------
 global/testing/perf.F      | 23 ++++++-----------------
 global/testing/perfmod.F   | 14 +++-----------
 global/testing/perform.F   | 14 +++-----------
 6 files changed, 27 insertions(+), 75 deletions(-)

Removing sleep() should not break correctness. I am concerned that something is broken in the logic of these tests, or perhaps the underlying implementation, but since NWChem doesn't use this stuff, I don't think about it much.

ajaypanyala commented 3 months ago

That commit is indeed the culprit. It was a silly mistake. Fix pushed to develop.