Closed sam-maloney closed 1 day ago
Thanks! Anything interesting in the test output when run with -v
?
My successful test shows:
$ ./t2004-hydra.t -v
/usr/bin/mpiexec.hydra
expecting success:
mpiexec.hydra -n 4 ${PMI_INFO}
0: size=4 appnum=0 maxes=256:64:1024 kvsname=kvs_922436_0_254925627_picl0
1: size=4 appnum=0 maxes=256:64:1024 kvsname=kvs_922436_0_254925627_picl0
2: size=4 appnum=0 maxes=256:64:1024 kvsname=kvs_922436_0_254925627_picl0
3: size=4 appnum=0 maxes=256:64:1024 kvsname=kvs_922436_0_254925627_picl0
ok 1 - Flux libpmi-client wire protocol works with Hydra
expecting success:
mpiexec.hydra -n 4 flux start ${ARGS} flux getattr size
4
ok 2 - Hydra can launch Flux
# passed all 2 test(s)
1..2
I get
$ ./t2004-hydra.t -v
/usr/bin/mpiexec.hydra
expecting success:
mpiexec.hydra -n 4 ${PMI_INFO}
0: size=4 appnum=0 maxes=256:64:1024 kvsname=kvs_487144_0_1843522888_zam723
2: size=4 appnum=0 maxes=256:64:1024 kvsname=kvs_487144_0_1843522888_zam723
1: size=4 appnum=0 maxes=256:64:1024 kvsname=kvs_487144_0_1843522888_zam723
3: size=4 appnum=0 maxes=256:64:1024 kvsname=kvs_487144_0_1843522888_zam723
ok 1 - Flux libpmi-client wire protocol works with Hydra
expecting success:
mpiexec.hydra -n 4 flux start ${ARGS} flux getattr size
flux-broker: error decoding rank 1 business card
flux-broker: bootstrap failed
flux-broker: error decoding rank 0 business card
flux-broker: bootstrap failed
flux-broker: error decoding rank 0 business card
flux-broker: bootstrap failed
flux-broker: error decoding rank 0 business card
flux-broker: bootstrap failed
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 487153 RUNNING AT zam723
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
not ok 2 - Hydra can launch Flux
#
# mpiexec.hydra -n 4 flux start ${ARGS} flux getattr size
#
# failed 1 among 2 test(s)
1..2
Oh that is surprising. I built mpich-4.2.0 on my ubuntu 22.04 development system (no options passed to ./configure
) and reproduced it. Here is a run with a client PMI trace:
$ FLUX_PMI_DEBUG=1 /home/garlick/other/mpich/src/pm/hydra/mpiexec.hydra -n 4 src/cmd/flux start flux getattr size
boot_pmi: flux-pmi-client: trying 'simple'
boot_pmi: simple: selected
boot_pmi: flux-pmi-client: trying 'simple'
boot_pmi: simple: selected
boot_pmi: simple: initialize: rank=1 size=4 name=kvs_2287808_0_956797288_system76-pc: success
boot_pmi: simple: initialize: rank=2 size=4 name=kvs_2287808_0_956797288_system76-pc: success
boot_pmi: simple: get key=flux.instance-level: PMI not initialized
boot_pmi: simple: get key=flux.instance-level: PMI not initialized
boot_pmi: simple: get key=flux.taskmap: PMI not initialized
boot_pmi: simple: get key=flux.taskmap: PMI not initialized
boot_pmi: simple: get key=PMI_process_mapping value=(vector,(0,1,4)) found=TRUE: success
boot_pmi: simple: get key=PMI_process_mapping value=(vector,(0,1,4)) found=TRUE: success
boot_pmi: simple: put key=1 value={"hostname":"system76-pc","pubkey":"4c>8I1/IV+O2[4uQO0APVB-P[2%!PiV-d#]h06W1","uri":""}: success
boot_pmi: simple: put key=2 value={"hostname":"system76-pc","pubkey":"./:@UF8OPH^)1y.yekqlz>x#?M{w?p&OHRzn6i.t","uri":""}: success
boot_pmi: flux-pmi-client: trying 'simple'
boot_pmi: simple: selected
boot_pmi: flux-pmi-client: trying 'simple'
boot_pmi: simple: selected
boot_pmi: simple: initialize: rank=0 size=4 name=kvs_2287808_0_956797288_system76-pc: success
boot_pmi: simple: get key=flux.instance-level: PMI not initialized
boot_pmi: simple: initialize: rank=3 size=4 name=kvs_2287808_0_956797288_system76-pc: success
boot_pmi: simple: get key=flux.taskmap: PMI not initialized
boot_pmi: simple: get key=flux.instance-level: PMI not initialized
boot_pmi: simple: get key=PMI_process_mapping value=(vector,(0,1,4)) found=TRUE: success
boot_pmi: simple: get key=flux.taskmap: PMI not initialized
boot_pmi: simple: get key=PMI_process_mapping value=(vector,(0,1,4)) found=TRUE: success
boot_pmi: simple: put key=3 value={"hostname":"system76-pc","pubkey":"2VsrBZw6pOz1e=j7AC**pbjZ3KYS4RY%(Av*/(^0","uri":""}: success
boot_pmi: simple: put key=0 value={"hostname":"system76-pc","pubkey":"Y!i!65:M:E0^:wAY4^A)I$HCLw%48m=t?.dvycpC","uri":"ipc:///tmp/flux-gHXYn3/tbon-0"}: success
boot_pmi: simple: barrier: success
boot_pmi: simple: barrier: success
boot_pmi: simple: get key=1 value={"hostname":"system76-pc","pubkey":"4c>8I1/IV+O2[4uQO0APVB-P[2%!PiV-d#]h06W1","uri":""} found=TRUE: success
flux-broker: error decoding rank 1 business card
boot_pmi: simple: barrier: success
boot_pmi: simple: barrier: success
flux-broker: bootstrap failed
boot_pmi: simple: get key=0 value={"hostname":"system76-pc","pubkey":"Y!i!65:M:E0^:wAY4^A)I$HCLw%48m=t?.dvycpC","uri":"ipc:///tmp/flux-gHXYn3/tbon-0"} found=TRUE: success
flux-broker: error decoding rank 0 business card
flux-broker: bootstrap failed
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 2287812 RUNNING AT system76-pc
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
I suspected the business card JSON object was getting corrupted when passed through hydra's PMI but it appears intact from the logs. E.g. for rank 0:
in
{"hostname":"system76-pc","pubkey":"Y!i!65:M:E0^:wAY4^A)I$HCLw%48m=t?.dvycpC","uri":"ipc:///tmp/flux-gHXYn3/tbon-0"}
out
{"hostname":"system76-pc","pubkey":"Y!i!65:M:E0^:wAY4^A)I$HCLw%48m=t?.dvycpC","uri":"ipc:///tmp/flux-gHXYn3/tbon-0"}
more debugging needed.
The following change gives us more information
diff --git a/src/broker/boot_pmi.c b/src/broker/boot_pmi.c
index 3f55231ec..5c2df1e81 100644
--- a/src/broker/boot_pmi.c
+++ b/src/broker/boot_pmi.c
@@ -222,6 +222,7 @@ int boot_pmi (struct overlay *overlay, attr_t *attrs)
int i;
int upmi_flags = UPMI_LIBPMI_NOFLUX;
const char *upmi_method;
+ json_error_t jerror;
// N.B. overlay_create() sets the tbon.topo attribute
if (attr_get (attrs, "tbon.topo", &topo_uri, NULL) < 0) {
@@ -351,12 +352,21 @@ int boot_pmi (struct overlay *overlay, attr_t *attrs)
log_msg ("%s: get %s: %s", upmi_describe (upmi), key, error.text);
goto error;
}
- if (!(o = json_loads (val, 0, NULL))
- || json_unpack (o,
- "{s:s s:s}",
- "pubkey", &peer_pubkey,
- "uri", &peer_uri) < 0
- || strlen (peer_uri) == 0) {
+ if (!(o = json_loads (val, 0, &jerror))
+ || json_unpack_ex (o,
+ &jerror,
+ 0,
+ "{s:s s:s}",
+ "pubkey", &peer_pubkey,
+ "uri", &peer_uri) < 0) {
+ log_msg ("error decoding rank %d business card: %s",
+ parent_rank,
+ jerror.text);
+ json_decref (o);
+ free (val);
+ goto error;
+ }
+ if (strlen (peer_uri) == 0) {
log_msg ("error decoding rank %d business card", parent_rank);
json_decref (o);
free (val);
boot_pmi: simple: get key=0 value={"hostname":"system76-pc","pubkey":"zb.H!d6x]Hy98k8i6(LwCK9fiMUHFFawgscu7E{g","uri":"ipc:///tmp/flux-U8kvW7/tbon-0"} found=TRUE: success
flux-broker: error decoding rank 0 business card: end of file expected near 'found'
Ah, found=TRUE
is being appended to the value? This seems like a bug in mpich.
A quick workaround in flux that fixes this for me is:
diff --git a/src/broker/boot_pmi.c b/src/broker/boot_pmi.c
index 3f55231ec..51e7c7234 100644
--- a/src/broker/boot_pmi.c
+++ b/src/broker/boot_pmi.c
@@ -351,7 +351,7 @@ int boot_pmi (struct overlay *overlay, attr_t *attrs)
log_msg ("%s: get %s: %s", upmi_describe (upmi), key, error.text);
goto error;
}
- if (!(o = json_loads (val, 0, NULL))
+ if (!(o = json_loads (val, JSON_DISABLE_EOF_CHECK, NULL))
|| json_unpack (o,
"{s:s s:s}",
"pubkey", &peer_pubkey,
@@ -392,7 +392,7 @@ int boot_pmi (struct overlay *overlay, attr_t *attrs)
log_msg ("%s: get %s: %s", upmi_describe (upmi), key, error.text);
goto error;
}
- if (!(o = json_loads (val, 0, NULL))
+ if (!(o = json_loads (val, JSON_DISABLE_EOF_CHECK, NULL))
|| json_unpack (o, "{s:s}", "pubkey", &peer_pubkey) < 0) {
log_msg ("error decoding rank %d business card", child_rank);
json_decref (o);
@@ -423,7 +423,7 @@ int boot_pmi (struct overlay *overlay, attr_t *attrs)
log_msg ("%s: get %s: %s", upmi_describe (upmi), key, error.text);
goto error;
}
- if (!(o = json_loads (val, 0, NULL))
+ if (!(o = json_loads (val, JSON_DISABLE_EOF_CHECK, NULL))
|| json_unpack (o, "{s:s}", "hostname", &peer_hostname) < 0) {
log_msg ("error decoding rank %d pmi business card", i);
json_decref (o);
but we'll need to work out if this was an intentional protocol change or a weakness in our client or what.
Test t2004-hydra is consistently failing when I run
make check
after building. I am doing a standard build with no special configuration options, just./autogen.sh
then./configure
thenmake
It is occurring both when starting from a clone of the current GitHub as well as when downloading the latest (pre-)release tarball, currently v0.63.0 (without running
./autogen.sh
in this case).I am building on Ubuntu 24.04 using GCC 13.2.0 and mpiexec.hydra 4.2.0 as well as on a VM running Rocky9 using GCC 11.4.1 and mpiexec.hydra 4.1.1. I get the same failures on both, with the contents of the related test files are below:
contents of
t2004-hydra.log
contents of
t2004-hydra.trs
contents of
t2004-hydra.#######.counts
Happy to provide other information if it might be helpful.