common-tools-interface / cti

Common Tools Interface
Other
7 stars 1 forks source link

Tool attaches after PALS barrier #19

Open kent-cheung-arm opened 2 years ago

kent-cheung-arm commented 2 years ago

In moderately-sized PALS jobs, the tool sometimes attaches after the barrier even though cti_releaseAppBarrier was not yet called.

Attaching at barrier as expected:

[57.938218490]info stack
&"info stack\n"
~"#0  0x00002b99d2b92707 in kill () from /lib64/libc.so.6\n"
~"#1  0x00002b99d4971123 in pals_start_barrier (state=state@entry=0x2b99d3a0db00) at /workspace/rpmbuild/BUILD/cray-pals-1.1.3/src/libpals/libpals.c:843\n"
~"#2  0x00002b99d37ff64e in _pmi_pals_sync () at /workspace/src/pals/pals_utils.c:408\n"
~"#3  0x00002b99d37f6ab4 in _pmi_init (spawned=spawned@entry=0x7ffd06de0c1c) at /workspace/src/pmi_core/_pmi_init.c:1431\n"
~"#4  0x00002b99d37f74f4 in _pmi_constructor () at /workspace/src/pmi_core/_pmi_init.c:366\n"
~"#5  0x00002b99cf708aba in call_init.part () from /lib64/ld-linux-x86-64.so.2\n"
~"#6  0x00002b99cf708bc6 in _dl_init () from /lib64/ld-linux-x86-64.so.2\n"
~"#7  0x00002b99cf6f9eda in _dl_start_user () from /lib64/ld-linux-x86-64.so.2\n"
~"#8  0x0000000000000002 in ?? ()\n"
~"#9  0x00007ffd06de261e in ?? ()\n"
~"#10 0x00007ffd06de263e in ?? ()\n"
~"#11 0x0000000000000000 in ?? ()\n"

Attaching at MPI_Init after barrier:

[58.538222668]info stack
&"info stack\n"
~"#0  0x00002b10b6ff64eb in _pmi_smp_barrier_join (smp_bar=0x2b10b7218310, restrict_to_app=restrict_to_app@entry=0) at /workspace/src/pmi_core/smp_barrier.c:81\n"
~"#1  0x00002b10b6fee137 in _pmi_barrier (bar_tag=bar_tag@entry=BARRIER_PACKET, restrict_to_app=restrict_to_app@entry=0) at /workspace/src/pmi_core/_pmi_barrier.c:50\n"
~"#2  0x00002b10b6ff90d1 in PMI_Barrier () at /workspace/src/api/coll/pmi_barrier.c:27\n"
~"#3  0x00002b10b6ff9977 in PMI2_Init (spawned=0x7ffc4ecea6a0, size=0x7ffc4ecea6a8, rank=0x7ffc4ecea6a4, appnum=0x7ffc4ecea6ac) at /workspace/src/api/misc/pmi_init.c:182\n"
~"#4  0x00002b10b577bd41 in MPIR_pmi_init () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12\n"
~"#5  0x00002b10b5780f76 in MPID_Init () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12\n"
~"#6  0x00002b10b3cec96d in MPIR_Init_thread () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12\n"
~"#7  0x00002b10b3cec744 in PMPI_Init () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12\n"
~"#8  0x000000000040138d in main (argc=2, argv=0x7ffc4ecea908)\n"

The call to cti_releaseAppBarrier occurred at 59.749728688 in this run.

ardangelo commented 1 year ago

We have started looking more in-depth at a similar issue filed under PE-43365. Ranks would be stopped before they have hit the barrier release. And because of this, SIGCONT would be consumed when some of the ranks are continued and get stuck in their barrier. The PALS team is looking into the root cause. CTI in PE 22.11 will have an environment variable CTI_PALS_BARRIER_RELEASE_DELAY to delay releasing the applications in the startup barrier. In our testing, setting it to 1 second is usually sufficient to avoid this race condition.