barrier: need a way for a process to cancel a barrier

grondo commented 5 years ago

During startup flux-shell executes a flux_barrier to ensure all shells have successfully initialized. Once the barrier is complete, the shells go on to start local processes.

In the case of a fatal error before the initial barrier, one or more shells may exit, leaving the other shells blocked in the barrier unaware that the job has failed until the job-exec module kills them.

It would be better if on fatal error during init, the shell could call the equivalent of flux_barrier() with an error, causing the barrier to exit immediately with an error for all current processes waiting and returning an error immediately for any new barrier.enter RPCs (up until nprocs processes have all registered)

This would allow the job to exit more or less immediately on early fatal errors, instead of waiting for the job-exec kill timeout. (Of course, if the fatal error makes it so that the dying shell can't call flux_barrier, then we fall back to the slow, safe method of job cleanup)

grondo commented 5 years ago

Actually, just after I typed that, I realized that a better solution for the shell might be to wait for completion of the flux_barrier() asynchronously in shell_barrier instead of with a synchronous flux_future_get() (i.e. call flux_reactor_run()).

The shells could then watch the job eventlog during this time and abort the barrier wait on any job exception.

Down the road, more of the shell initialization could be made asynchronous and the start of the initialization barrier could be pushed off into this early call to flux_reactor_run.

garlick commented 5 years ago

This is a general problem with the current barrier code, and should be addressed IMHO.

We could add an interface something like:

flux_future_t *flux_barrier_abort  (flux_t *h, const char *name, int nrocs, const char *errstr);

Pending and future matching barrier.enter calls would receive an immediate error response containing the errstr. Upon receiving nprocs matching calls, the state can be discarded.

It's probably not a huge deal if unfinished barriers pile up in the barrier module since the use so little state and hopefully it's not a common case, but we could eventually add some sort of garbage collection.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 14 days. Thank you for your contributions.

flux-framework / flux-core

barrier: need a way for a process to cancel a barrier #2487