Open jameshcorbett opened 4 months ago
Good question. Just thinking out loud here, but maybe instead of configuration, which could get confusing since it may or may not apply to subinstances on the same system and need not apply to all jobs either (e.g. they are not requesting dws resources), we need a way to either tell flux job attach
that a finish
event is not sufficient to indicate that all job "tasks" have finished (e.g. supply an alternate event to block until), or have some way to tell the job manager to delay posting the finish
event (e.g. a reference count like we have for the prolog and epilog).
I guess one benefit of the second approach is that this could allow dws to modify the finish
status
field for a job before the finish event is posted, however this would need some thought because finish
is a first class event in the state diagram, and it triggers the epilogs, so it is not going to be simple.
Probably defining a new event that gives a hint to flux job attach
that it should wait for a different event than finish
would be the simplest approach for now, but I'm not sure if that's a bit too kludgy (@garlick?). Also, something does bother me that we have jobs that aren't really finished at finish
.
Well, the job is complete in a way after finish
(i.e. tasks have exited), it's just that data hasn't been moved from the rabbit out to the backing parallel file system.
Maybe a slight modification to @grondo's first idea would be to have an epilog-start
event signal (via flag?) that finish
should be delayed until the corresponding epilog-finish
?
Edit: oh on reread, I guess I was just restating what you already said @grondo - sorry.
Well, the job is complete in a way after finish (i.e. tasks have exited), it's just that data hasn't been moved from the rabbit out to the backing parallel file system.
Is that argument for telling users to add --wait-event=clean
manually when they want to wait for the data movement?
Edit: oh on reread, I guess I was just restating what you already said @grondo - sorry.
Yeah, unfortunately the finish
event triggers the epilog so chicken and egg.
Well, the job is complete in a way after finish (i.e. tasks have exited), it's just that data hasn't been moved from the rabbit out to the backing parallel file system.
Yeah, it would be nice though if we could account for all "tasks" of a job: compute tasks, data movement, post processing, whatever else, better than we can now. The finish
event notes when the job execution system part of the job is complete, but we should support other "parts", e.g. data movement in this case. Optional other subsystems should be able to tell the job manager and other users of the eventlog: hey, I'm doing something too, and then all watchers would know to wait for that thing in addition to the execution system finish
notification.
Well, the job is complete in a way after finish (i.e. tasks have exited), it's just that data hasn't been moved from the rabbit out to the backing parallel file system.
Is that argument for telling users to add
--wait-event=clean
manually when they want to wait for the data movement?
I think that might lead to a lot of angry users. Brian and Marty and some of the people at HPE (who also use the rabbits) expected flux run
by default to wait until data movement was complete.
A side note is that the epilog is really supposed to be for system administration tasks, i.e. the user's part of the job is done and now the resources are being cleaned up to be returned to the system. It would be nice if data movement or other subsystems on behalf of the user was a different operation than an epilog or an epilog that didn't run after the 'finish' event. (Sorry this issue got into a kind of broad topic)
We should also check to ensure that job dependencies would wait until the right state before releasing a dependent job. (I can't remember if afterok:jobid
waits for the finish event or until the job is inactive.)
Problem: rabbit jobs are not complete and users cannot look for their data until after the dws-jobtap epilog has completed and event
clean
has been logged, butflux job attach
and thereforeflux run
(and perhaps other tools, likeflux alloc
) only wait forfinish
. This has already confused and frustrated users. Somehow it would be nice to configure rabbit clusters or rabbit jobs to wait forclean
by default.Any thoughts @grondo and @garlick ?