Open grondo opened 5 years ago
Actually, just after I typed that, I realized that a better solution for the shell might be to wait for completion of the flux_barrier()
asynchronously in shell_barrier
instead of with a synchronous flux_future_get()
(i.e. call flux_reactor_run()
).
The shells could then watch the job eventlog during this time and abort the barrier wait on any job exception.
Down the road, more of the shell initialization could be made asynchronous and the start of the initialization barrier could be pushed off into this early call to flux_reactor_run
.
This is a general problem with the current barrier code, and should be addressed IMHO.
We could add an interface something like:
flux_future_t *flux_barrier_abort (flux_t *h, const char *name, int nrocs, const char *errstr);
Pending and future matching barrier.enter
calls would receive an immediate error response containing the errstr
. Upon receiving nprocs matching calls, the state can be discarded.
It's probably not a huge deal if unfinished barriers pile up in the barrier module since the use so little state and hopefully it's not a common case, but we could eventually add some sort of garbage collection.
This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 14 days. Thank you for your contributions.
During startup
flux-shell
executes aflux_barrier
to ensure all shells have successfully initialized. Once the barrier is complete, the shells go on to start local processes.In the case of a fatal error before the initial barrier, one or more shells may exit, leaving the other shells blocked in the barrier unaware that the job has failed until the
job-exec
module kills them.It would be better if on fatal error during init, the shell could call the equivalent of
flux_barrier()
with an error, causing the barrier to exit immediately with an error for all current processes waiting and returning an error immediately for any newbarrier.enter
RPCs (up untilnprocs
processes have all registered)This would allow the job to exit more or less immediately on early fatal errors, instead of waiting for the job-exec kill timeout. (Of course, if the fatal error makes it so that the dying shell can't call
flux_barrier
, then we fall back to the slow, safe method of job cleanup)