flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
159 stars 49 forks source link

t2004-hydra failing on basic build #6072

Closed sam-maloney closed 1 day ago

sam-maloney commented 3 days ago

Test t2004-hydra is consistently failing when I run make check after building. I am doing a standard build with no special configuration options, just ./autogen.sh then ./configure then make

It is occurring both when starting from a clone of the current GitHub as well as when downloading the latest (pre-)release tarball, currently v0.63.0 (without running ./autogen.sh in this case).

I am building on Ubuntu 24.04 using GCC 13.2.0 and mpiexec.hydra 4.2.0 as well as on a VM running Rocky9 using GCC 11.4.1 and mpiexec.hydra 4.1.1. I get the same failures on both, with the contents of the related test files are below:

contents of t2004-hydra.log

/usr/bin/mpiexec.hydra
ok 1 - Flux libpmi-client wire protocol works with Hydra
PASS: t2004-hydra.t 1 - Flux libpmi-client wire protocol works with Hydra
not ok 2 - Hydra can launch Flux
FAIL: t2004-hydra.t 2 - Hydra can launch Flux
#   
#       mpiexec.hydra -n 4 flux start ${ARGS} flux getattr size
#   
# failed 1 among 2 test(s)
1..2
ERROR: t2004-hydra.t - exited with status 1

contents of t2004-hydra.trs

:global-test-result: ERROR
:recheck: yes
:copy-in-global-log: yes
:test-result: PASS
:test-result: FAIL
:test-result: ERROR

contents of t2004-hydra.#######.counts

total 2
success 1
fixed 0
broken 0
failed 1

Happy to provide other information if it might be helpful.

garlick commented 3 days ago

Thanks! Anything interesting in the test output when run with -v?

My successful test shows:

$ ./t2004-hydra.t -v
/usr/bin/mpiexec.hydra
expecting success: 
    mpiexec.hydra -n 4 ${PMI_INFO}

0: size=4 appnum=0 maxes=256:64:1024 kvsname=kvs_922436_0_254925627_picl0
1: size=4 appnum=0 maxes=256:64:1024 kvsname=kvs_922436_0_254925627_picl0
2: size=4 appnum=0 maxes=256:64:1024 kvsname=kvs_922436_0_254925627_picl0
3: size=4 appnum=0 maxes=256:64:1024 kvsname=kvs_922436_0_254925627_picl0
ok 1 - Flux libpmi-client wire protocol works with Hydra

expecting success: 
    mpiexec.hydra -n 4 flux start ${ARGS} flux getattr size

4
ok 2 - Hydra can launch Flux

# passed all 2 test(s)
1..2
sam-maloney commented 2 days ago

I get

$ ./t2004-hydra.t -v
/usr/bin/mpiexec.hydra
expecting success: 
    mpiexec.hydra -n 4 ${PMI_INFO}

0: size=4 appnum=0 maxes=256:64:1024 kvsname=kvs_487144_0_1843522888_zam723
2: size=4 appnum=0 maxes=256:64:1024 kvsname=kvs_487144_0_1843522888_zam723
1: size=4 appnum=0 maxes=256:64:1024 kvsname=kvs_487144_0_1843522888_zam723
3: size=4 appnum=0 maxes=256:64:1024 kvsname=kvs_487144_0_1843522888_zam723
ok 1 - Flux libpmi-client wire protocol works with Hydra

expecting success: 
    mpiexec.hydra -n 4 flux start ${ARGS} flux getattr size

flux-broker: error decoding rank 1 business card
flux-broker: bootstrap failed
flux-broker: error decoding rank 0 business card
flux-broker: bootstrap failed
flux-broker: error decoding rank 0 business card
flux-broker: bootstrap failed
flux-broker: error decoding rank 0 business card
flux-broker: bootstrap failed

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 487153 RUNNING AT zam723
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
not ok 2 - Hydra can launch Flux
#   
#       mpiexec.hydra -n 4 flux start ${ARGS} flux getattr size
#   

# failed 1 among 2 test(s)
1..2
garlick commented 2 days ago

Oh that is surprising. I built mpich-4.2.0 on my ubuntu 22.04 development system (no options passed to ./configure) and reproduced it. Here is a run with a client PMI trace:

$ FLUX_PMI_DEBUG=1 /home/garlick/other/mpich/src/pm/hydra/mpiexec.hydra  -n 4 src/cmd/flux start  flux getattr size
boot_pmi: flux-pmi-client: trying 'simple'
boot_pmi: simple: selected
boot_pmi: flux-pmi-client: trying 'simple'
boot_pmi: simple: selected
boot_pmi: simple: initialize: rank=1 size=4 name=kvs_2287808_0_956797288_system76-pc: success
boot_pmi: simple: initialize: rank=2 size=4 name=kvs_2287808_0_956797288_system76-pc: success
boot_pmi: simple: get key=flux.instance-level: PMI not initialized
boot_pmi: simple: get key=flux.instance-level: PMI not initialized
boot_pmi: simple: get key=flux.taskmap: PMI not initialized
boot_pmi: simple: get key=flux.taskmap: PMI not initialized
boot_pmi: simple: get key=PMI_process_mapping value=(vector,(0,1,4)) found=TRUE: success
boot_pmi: simple: get key=PMI_process_mapping value=(vector,(0,1,4)) found=TRUE: success
boot_pmi: simple: put key=1 value={"hostname":"system76-pc","pubkey":"4c>8I1/IV+O2[4uQO0APVB-P[2%!PiV-d#]h06W1","uri":""}: success
boot_pmi: simple: put key=2 value={"hostname":"system76-pc","pubkey":"./:@UF8OPH^)1y.yekqlz>x#?M{w?p&OHRzn6i.t","uri":""}: success
boot_pmi: flux-pmi-client: trying 'simple'
boot_pmi: simple: selected
boot_pmi: flux-pmi-client: trying 'simple'
boot_pmi: simple: selected
boot_pmi: simple: initialize: rank=0 size=4 name=kvs_2287808_0_956797288_system76-pc: success
boot_pmi: simple: get key=flux.instance-level: PMI not initialized
boot_pmi: simple: initialize: rank=3 size=4 name=kvs_2287808_0_956797288_system76-pc: success
boot_pmi: simple: get key=flux.taskmap: PMI not initialized
boot_pmi: simple: get key=flux.instance-level: PMI not initialized
boot_pmi: simple: get key=PMI_process_mapping value=(vector,(0,1,4)) found=TRUE: success
boot_pmi: simple: get key=flux.taskmap: PMI not initialized
boot_pmi: simple: get key=PMI_process_mapping value=(vector,(0,1,4)) found=TRUE: success
boot_pmi: simple: put key=3 value={"hostname":"system76-pc","pubkey":"2VsrBZw6pOz1e=j7AC**pbjZ3KYS4RY%(Av*/(^0","uri":""}: success
boot_pmi: simple: put key=0 value={"hostname":"system76-pc","pubkey":"Y!i!65:M:E0^:wAY4^A)I$HCLw%48m=t?.dvycpC","uri":"ipc:///tmp/flux-gHXYn3/tbon-0"}: success
boot_pmi: simple: barrier: success
boot_pmi: simple: barrier: success
boot_pmi: simple: get key=1 value={"hostname":"system76-pc","pubkey":"4c>8I1/IV+O2[4uQO0APVB-P[2%!PiV-d#]h06W1","uri":""} found=TRUE: success
flux-broker: error decoding rank 1 business card
boot_pmi: simple: barrier: success
boot_pmi: simple: barrier: success
flux-broker: bootstrap failed
boot_pmi: simple: get key=0 value={"hostname":"system76-pc","pubkey":"Y!i!65:M:E0^:wAY4^A)I$HCLw%48m=t?.dvycpC","uri":"ipc:///tmp/flux-gHXYn3/tbon-0"} found=TRUE: success
flux-broker: error decoding rank 0 business card
flux-broker: bootstrap failed

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 2287812 RUNNING AT system76-pc
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

I suspected the business card JSON object was getting corrupted when passed through hydra's PMI but it appears intact from the logs. E.g. for rank 0:

in

{"hostname":"system76-pc","pubkey":"Y!i!65:M:E0^:wAY4^A)I$HCLw%48m=t?.dvycpC","uri":"ipc:///tmp/flux-gHXYn3/tbon-0"}

out

{"hostname":"system76-pc","pubkey":"Y!i!65:M:E0^:wAY4^A)I$HCLw%48m=t?.dvycpC","uri":"ipc:///tmp/flux-gHXYn3/tbon-0"}

more debugging needed.

garlick commented 2 days ago

The following change gives us more information

diff --git a/src/broker/boot_pmi.c b/src/broker/boot_pmi.c
index 3f55231ec..5c2df1e81 100644
--- a/src/broker/boot_pmi.c
+++ b/src/broker/boot_pmi.c
@@ -222,6 +222,7 @@ int boot_pmi (struct overlay *overlay, attr_t *attrs)
     int i;
     int upmi_flags = UPMI_LIBPMI_NOFLUX;
     const char *upmi_method;
+    json_error_t jerror;

     // N.B. overlay_create() sets the tbon.topo attribute
     if (attr_get (attrs, "tbon.topo", &topo_uri, NULL) < 0) {
@@ -351,12 +352,21 @@ int boot_pmi (struct overlay *overlay, attr_t *attrs)
             log_msg ("%s: get %s: %s", upmi_describe (upmi), key, error.text);
             goto error;
         }
-        if (!(o = json_loads (val, 0, NULL))
-            || json_unpack (o,
-                            "{s:s s:s}",
-                            "pubkey", &peer_pubkey,
-                            "uri", &peer_uri) < 0
-            || strlen (peer_uri) == 0) {
+        if (!(o = json_loads (val, 0, &jerror))
+            || json_unpack_ex (o,
+                               &jerror,
+                               0,
+                               "{s:s s:s}",
+                               "pubkey", &peer_pubkey,
+                               "uri", &peer_uri) < 0) {
+            log_msg ("error decoding rank %d business card: %s",
+                     parent_rank,
+                     jerror.text);
+            json_decref (o);
+            free (val);
+            goto error;
+        }
+        if (strlen (peer_uri) == 0) {
             log_msg ("error decoding rank %d business card", parent_rank);
             json_decref (o);
             free (val);
boot_pmi: simple: get key=0 value={"hostname":"system76-pc","pubkey":"zb.H!d6x]Hy98k8i6(LwCK9fiMUHFFawgscu7E{g","uri":"ipc:///tmp/flux-U8kvW7/tbon-0"} found=TRUE: success
flux-broker: error decoding rank 0 business card: end of file expected near 'found'

Ah, found=TRUE is being appended to the value? This seems like a bug in mpich.

A quick workaround in flux that fixes this for me is:

diff --git a/src/broker/boot_pmi.c b/src/broker/boot_pmi.c
index 3f55231ec..51e7c7234 100644
--- a/src/broker/boot_pmi.c
+++ b/src/broker/boot_pmi.c
@@ -351,7 +351,7 @@ int boot_pmi (struct overlay *overlay, attr_t *attrs)
             log_msg ("%s: get %s: %s", upmi_describe (upmi), key, error.text);
             goto error;
         }
-        if (!(o = json_loads (val, 0, NULL))
+        if (!(o = json_loads (val, JSON_DISABLE_EOF_CHECK, NULL))
             || json_unpack (o,
                             "{s:s s:s}",
                             "pubkey", &peer_pubkey,
@@ -392,7 +392,7 @@ int boot_pmi (struct overlay *overlay, attr_t *attrs)
             log_msg ("%s: get %s: %s", upmi_describe (upmi), key, error.text);
             goto error;
         }
-        if (!(o = json_loads (val, 0, NULL))
+        if (!(o = json_loads (val, JSON_DISABLE_EOF_CHECK, NULL))
             || json_unpack (o, "{s:s}", "pubkey", &peer_pubkey) < 0) {
             log_msg ("error decoding rank %d business card", child_rank);
             json_decref (o);
@@ -423,7 +423,7 @@ int boot_pmi (struct overlay *overlay, attr_t *attrs)
             log_msg ("%s: get %s: %s", upmi_describe (upmi), key, error.text);
             goto error;
         }
-        if (!(o = json_loads (val, 0, NULL))
+        if (!(o = json_loads (val, JSON_DISABLE_EOF_CHECK, NULL))
             || json_unpack (o, "{s:s}", "hostname", &peer_hostname) < 0) {
             log_msg ("error decoding rank %d pmi business card", i);
             json_decref (o);

but we'll need to work out if this was an intentional protocol change or a weakness in our client or what.