flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

future fulfilled with unspecified error when broker exits #5812

Open grondo opened 6 months ago

grondo commented 6 months ago

When a process connected to a broker is waiting for a future to be fulfilled and the broker exits, flux_future_get() returns an error (-1), but the errno is either not set or invalid, resulting in confusing error messages.

Here's an example with flux-dmesg(1):

$ src/cmd/flux start -s 4 --test-exit-mode=leader --test-pmi-clique=per-broker -o -Stbon.topo=kary:0 bash -c '(FLUX_URI=$(flux exec -r3 flux getattr local-uri) flux dmesg -HLnf &) && sleep 1 && flux overlay disconnect 3 && wait'
flux-overlay: asking corona212 (rank 0) to disconnect child corona212 (rank 3)
Mar 21 08:21:33.904967 broker.err[0]: corona212 (rank 3) transitioning to LOST due to administrative action
Mar 21 08:21:33.905168 broker.crit[3]: corona212 (rank 0) sent disconnect control message
[Mar21 08:21] broker[3]: corona212 (rank 0) sent disconnect control message
[  +0.000170] broker[3]: shutdown: run->cleanup 1.03355s
[  +0.000205] broker[3]: cleanup-none: cleanup->shutdown 0.02316ms
[  +0.000238] broker[3]: children-none: shutdown->finalize 0.023781ms
[  +0.001014] broker[3]: state-machine.monitor: No route to host
[  +0.084695] broker[3]: rmmod resource
[  +0.085065] broker[3]: module resource exited
[  +0.136105] broker[3]: rmmod job-info
[  +0.136219] broker[3]: module job-info exited
[  +0.185526] broker[3]: rmmod job-ingest
[  +0.185691] broker[3]: module job-ingest exited
[  +0.331674] broker[3]: rmmod barrier
[  +0.331807] broker[3]: module barrier exited
[  +0.380347] broker[3]: rmmod kvs-watch
[  +0.380474] broker[3]: module kvs-watch exited
[  +0.431345] broker[3]: rmmod kvs
[  +0.431458] broker[3]: module kvs exited
[  +0.543153] broker[3]: rmmod content
[  +0.543301] broker[3]: module content exited
[  +0.544237] broker[3]: rc3.0: /g/g0/grondo/git/f.git/etc/rc3 Exited (rc=0) 0.5s
[  +0.544330] broker[3]: rc3-success: finalize->goodbye 0.544083s
[  +0.544404] broker[3]: goodbye: goodbye->exit 0.066921ms
flux-dmesg: log.dmesg: Success
flux-start: 3 (pid 796783) exited with rc=1

Possibly an ECONNRESET or other more useful errno is being overwritten somewhere in the error handling here.

grondo commented 6 months ago

Using strace shows that poll() is returning POLLHUP in revents when the connected broker exits.

Some printf debugging shows that the local connector does promote a POLLHUP event to POLLERR, which in turn causes flux_pollevents() to call comms_error (h, ECONNRESET). However, I wasn't sure how this should be promoted to an error that filters up into flux_recv() etc.