flux-framework / flux-pmix

flux shell plugin to bootstrap openmpi v5+
GNU Lesser General Public License v3.0
2 stars 4 forks source link

not ok 3 - 1n2p barrier tolerates pmix.timeout=2 #70

Closed garlick closed 1 year ago

garlick commented 1 year ago

This test sometimes fails in CI in the fedora builder:

    expecting success:
            run_timeout 30 flux mini run -N1 -n2 \
                    ${BARRIER} --timeout=2

      fkdhs3d.0: completed PMIx_Init.
      fkdhs3d.1: PMIx_Fence: TIMEOUT
      fkdhs3d.0: there are 2 tasks
      fkdhs3d.0: completed barrier in 0.003s.
      fkdhs3d.0: completed PMIx_Finalize.
      0.106s: flux-shell[0]:  WARN: pmix: unknown optional fence attr: pmix.timeout
      0.106s: flux-shell[0]:  WARN: pmix: unknown optional fence attr: pmix.sorted.parr
      flux-job: task(s) exited with exit code 1
      not ok 3 - 1n2p barrier tolerates pmix.timeout=2
garlick commented 1 year ago

In my issue#68 branch I added some debug which at least affirms that some things are working correctly:

 expecting success: 
    run_timeout 30 flux mini run -N1 -n2 -overbose=2 \
        ${BARRIER} --timeout=2

  0.047s: flux-shell[0]: DEBUG: Loading /etc/flux/shell/initrc.lua
  0.047s: flux-shell[0]: TRACE: Sucessfully loaded flux.shell module
  0.047s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/initrc.lua
  0.047s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/intel_mpi.lua
  0.047s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/mvapich.lua
  0.049s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/openmpi.lua
  0.049s: flux-shell[0]: TRACE: trying to load /usr/src/t/etc/rc.lua
  0.052s: flux-shell[0]: DEBUG: output: batch timeout = 0.500s
  0.053s: flux-shell[0]: DEBUG: pmix: jobid = 22733127680
  0.053s: flux-shell[0]: DEBUG: pmix: shell_rank = 0
  0.053s: flux-shell[0]: DEBUG: pmix: local_nprocs = 2
  0.053s: flux-shell[0]: DEBUG: pmix: total_nprocs = 2
  0.053s: flux-shell[0]: DEBUG: pmix: server outsourced to OpenPMIx 4.2.2rc2
  0.067s: flux-shell[0]: DEBUG: pmix: local_peers = 0,1
  0.067s: flux-shell[0]: DEBUG: pmix: node_map = fv-az566-435
  0.067s: flux-shell[0]: DEBUG: pmix: proc_map = 0,1
  0.067s: flux-shell[0]: DEBUG: 0: task_count=2 slot_count=2 cores_per_slot=1 slots_per_node=2
  0.067s: flux-shell[0]: DEBUG: 0: tasks [0-1] on cores 0-1
  0.091s: flux-shell[0]: TRACE: pmi: 0: C: pmi EOF
  0.091s: flux-shell[0]: DEBUG: task 0 complete status=1
  fbdrDrF.0: completed PMIx_Init.
  fbdrDrF.0: there are 2 tasks
  fbdrDrF.0: Setting pmix.timeout=2 flags=0
  fbdrDrF.0: PMIx_Fence: TIMEOUT
  fbdrDrF.1: Setting pmix.timeout=2 flags=0
  0.106s: flux-shell[0]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"fbdrDrF","rank":-2}],"info":[{"key":"pmix.timeout","flags":0,"value":{"type":6,"data":-766578431}},{"key":"pmix.sorted.parr","flags":0,"value":{"type":1,"data":true}},{"key":"pmix.loc.col.st","flags":0,"value":{"type":20,"data":0}}],"data":"CQIbBAkCAgA=","cbfunc":140250909953355,"cbdata":140250747890192}
  0.106s: flux-shell[0]: DEBUG: pmix: ignoring unknown optional fence attr: pmix.timeout
  0.106s: flux-shell[0]: DEBUG: pmix: ignoring unknown optional fence attr: pmix.sorted.parr
  0.106s: flux-shell[0]: DEBUG: pmix: ignoring unknown optional fence attr: pmix.loc.col.st
  0.106s: flux-shell[0]: TRACE: pmix: completed pmix exchange 0: size 0 SUCCESS
  0.106s: flux-shell[0]: TRACE: pmix: starting pmix exchange 0: size 0
  0.107s: flux-shell[0]: TRACE: pmi: 1: C: pmi EOF
  0.107s: flux-shell[0]: DEBUG: task 1 complete status=0
  0.117s: flux-shell[0]: DEBUG: exit 1
  flux-job: task(s) exited with exit code 1
  not ok 3 - 1n2p barrier tolerates pmix.timeout=2
garlick commented 1 year ago

Running the barrier client under valgrind (with one task) shows some kinda disturbing stuff:

==1227712== Memcheck, a memory error detector
==1227712== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1227712== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==1227712== Command: /home/garlick/proj/flux-pmix/t/src/barrier --timeout=2
==1227712== 
==1227712== Syscall param socketcall.getsockopt(optlen) points to uninitialised byte(s)
==1227712==    at 0x4BF279E: getsockopt_syscall (getsockopt.c:29)
==1227712==    by 0x4BF279E: getsockopt (getsockopt.c:94)
==1227712==    by 0x49F03B7: pmix_ptl_base_set_timeout (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x49F0766: pmix_ptl_base_make_connection (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x49F6EE5: connect_to_peer (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x48D2AB2: PMIx_Init (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x10A50E: main (barrier.c:189)
==1227712==  Address 0x1ffeffcf1c is on thread 1's stack
==1227712==  in frame #2, created by pmix_ptl_base_make_connection (???:)
==1227712== 
==1227712== Syscall param socketcall.getsockopt(optlen_out) points to uninitialised byte(s)
==1227712==    at 0x4BF279E: getsockopt_syscall (getsockopt.c:29)
==1227712==    by 0x4BF279E: getsockopt (getsockopt.c:94)
==1227712==    by 0x49F03B7: pmix_ptl_base_set_timeout (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x49F0766: pmix_ptl_base_make_connection (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x49F6EE5: connect_to_peer (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x48D2AB2: PMIx_Init (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x10A50E: main (barrier.c:189)
==1227712==  Address 0x1ffeffcf1c is on thread 1's stack
==1227712==  in frame #2, created by pmix_ptl_base_make_connection (???:)
==1227712== 
==1227712== Syscall param setsockopt(optlen) contains uninitialised byte(s)
==1227712==    at 0x4BF2C7E: setsockopt_syscall (setsockopt.c:29)
==1227712==    by 0x4BF2C7E: setsockopt (setsockopt.c:95)
==1227712==    by 0x49F0931: pmix_ptl_base_make_connection (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x49F6EE5: connect_to_peer (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x48D2AB2: PMIx_Init (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x10A50E: main (barrier.c:189)
==1227712== 
fN4coD9.0: completed PMIx_Init.
fN4coD9.0: there are 1 tasks
fN4coD9.0: Setting pmix.timeout=2 flags=0
==1227712== Conditional jump or move depends on uninitialised value(s)
==1227712==    at 0x49DB78F: flex128_encode_int (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x497A028: pmix41_bfrops_base_pack_general_int (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x496B49E: pmix_bfrops_base_pack_val (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x496B6A6: pmix_bfrops_base_pack_info (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x496980E: pmix_bfrops_base_pack (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x48D7DA1: PMIx_Fence_nb (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x48D8044: PMIx_Fence (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x10A6AE: main (barrier.c:218)
==1227712== 
==1227712== Conditional jump or move depends on uninitialised value(s)
==1227712==    at 0x49DB55F: flex128_encode_int (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x497A028: pmix41_bfrops_base_pack_general_int (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x496B49E: pmix_bfrops_base_pack_val (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x496B6A6: pmix_bfrops_base_pack_info (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x496980E: pmix_bfrops_base_pack (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x48D7DA1: PMIx_Fence_nb (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x48D8044: PMIx_Fence (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)
==1227712==    by 0x10A6AE: main (barrier.c:218)
==1227712== 
grondo commented 1 year ago

flex128_encode_int (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)

I thought this test was supposed to be using openpmix-4.2.2? Ignore me if this is on a real system and 4.1.0 should work fine.

garlick commented 1 year ago

Oops that was from a manual run in an environment I need to update - will try again

On Wed, Dec 21, 2022, 7:52 AM Mark Grondona @.***> wrote:

flex128_encode_int (in /opt/openpmix-4.1.0/lib/libpmix.so.2.5.0)

I thought this test was supposed to be using openpmix-4.2.2? Ignore me if this is on a real system and 4.1.0 should work fine.

— Reply to this email directly, view it on GitHub https://github.com/flux-framework/flux-pmix/issues/70#issuecomment-1361545133, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABJPW3PEZXAPEPH7PKCKGLWOMRUPANCNFSM6AAAAAATFX7K6E . You are receiving this because you authored the thread.Message ID: @.***>

garlick commented 1 year ago

This time with pmix-4.2.2

==1754783== Conditional jump or move depends on uninitialised value(s)
==1754783==    at 0x4A0D11F: flex128_encode_int (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x49A00CD: pmix41_bfrops_base_pack_general_int (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x498CEC3: pmix_bfrops_base_pack_val (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x498D0D1: pmix_bfrops_base_pack_info (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x498AEB8: pmix_bfrops_base_pack (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x48D7261: PMIx_Fence_nb (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x48D756D: PMIx_Fence (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x10A63F: main (barrier.c:218)
==1754783== 
==1754783== Conditional jump or move depends on uninitialised value(s)
==1754783==    at 0x4A0CEEF: flex128_encode_int (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x49A00CD: pmix41_bfrops_base_pack_general_int (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x498CEC3: pmix_bfrops_base_pack_val (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x498D0D1: pmix_bfrops_base_pack_info (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x498AEB8: pmix_bfrops_base_pack (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x48D7261: PMIx_Fence_nb (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x48D756D: PMIx_Fence (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x10A63F: main (barrier.c:218)
==1754783== 
==1754783== Conditional jump or move depends on uninitialised value(s)
==1754783==    at 0x4A0CF53: flex128_encode_int (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x49A00CD: pmix41_bfrops_base_pack_general_int (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x498CEC3: pmix_bfrops_base_pack_val (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x498D0D1: pmix_bfrops_base_pack_info (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x498AEB8: pmix_bfrops_base_pack (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x48D7261: PMIx_Fence_nb (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x48D756D: PMIx_Fence (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x10A63F: main (barrier.c:218)
==1754783== 
==1754783== Thread 2:
==1754783== Syscall param writev(vector[...]) points to uninitialised byte(s)
==1754783==    at 0x4BC6C3D: __writev (writev.c:26)
==1754783==    by 0x4BC6C3D: writev (writev.c:24)
==1754783==    by 0x4A14746: pmix_ptl_base_send_handler (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x4CFCFD7: ??? (in /usr/lib/x86_64-linux-gnu/libevent_core-2.1.so.7.0.1)
==1754783==    by 0x4CFEBF6: event_base_loop (in /usr/lib/x86_64-linux-gnu/libevent_core-2.1.so.7.0.1)
==1754783==    by 0x492ECA8: progress_engine (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x4B40B42: start_thread (pthread_create.c:442)
==1754783==    by 0x4BD1BB3: clone (clone.S:100)
==1754783==  Address 0x4ff0691 is 49 bytes inside a block of size 128 alloc'd
==1754783==    at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1754783==    by 0x4986EFC: pmix_bfrop_buffer_extend (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x49A0062: pmix41_bfrops_base_pack_general_int (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x4986F88: pmix_bfrop_store_data_type (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x498AEEF: pmix_bfrops_base_pack (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x48D6BCB: PMIx_Fence_nb (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x48D756D: PMIx_Fence (in /opt/openmpi-5.0.0rc9/lib/libpmix.so.2.6.2)
==1754783==    by 0x10A63F: main (barrier.c:218)
==1754783== 
garlick commented 1 year ago

I'm not sure why it's the case, but these warnings go away with the following change:

diff --git a/t/src/barrier.c b/t/src/barrier.c
index dd988e7..225acd1 100644
--- a/t/src/barrier.c
+++ b/t/src/barrier.c
@@ -212,6 +212,9 @@ int main (int argc, char **argv)
     monotime (&t);
     pmix_info_t info[8];
     pmix_proc_t *procs = NULL;
+
+    memset (info, 0, sizeof (info));
+
     size_t ninfo = parse_info_opts (p, info, sizeof (info) / sizeof (info[0]));
     size_t nprocs = parse_procs_opt (p, &self, &procs);

The ninfo array elements that are actually used (a subset of the stack-allocated array) are fully initialized so I'm not sure why valgrind is upset.

grondo commented 1 year ago

I've seen valgrind report false positive for uninitialized bytes when some optimization or other platform dependent thing was actually causing initialization that valigrind couldn't see. Not sure that is the case here, but by using memset you could have hinted to valgrind that the bytes are indeed initialized.

garlick commented 1 year ago

Should have been closed by #71