rabbit interface: add `--wait-event=clean` to `flux run` and `attach` by default

jameshcorbett commented 4 months ago

Problem: rabbit jobs are not complete and users cannot look for their data until after the dws-jobtap epilog has completed and event clean has been logged, but flux job attach and therefore flux run (and perhaps other tools, like flux alloc) only wait for finish. This has already confused and frustrated users. Somehow it would be nice to configure rabbit clusters or rabbit jobs to wait for clean by default.

Any thoughts @grondo and @garlick ?

grondo commented 4 months ago

Good question. Just thinking out loud here, but maybe instead of configuration, which could get confusing since it may or may not apply to subinstances on the same system and need not apply to all jobs either (e.g. they are not requesting dws resources), we need a way to either tell flux job attach that a finish event is not sufficient to indicate that all job "tasks" have finished (e.g. supply an alternate event to block until), or have some way to tell the job manager to delay posting the finish event (e.g. a reference count like we have for the prolog and epilog).

I guess one benefit of the second approach is that this could allow dws to modify the finish status field for a job before the finish event is posted, however this would need some thought because finish is a first class event in the state diagram, and it triggers the epilogs, so it is not going to be simple.

Probably defining a new event that gives a hint to flux job attach that it should wait for a different event than finish would be the simplest approach for now, but I'm not sure if that's a bit too kludgy (@garlick?). Also, something does bother me that we have jobs that aren't really finished at finish.

jameshcorbett commented 4 months ago

Well, the job is complete in a way after finish (i.e. tasks have exited), it's just that data hasn't been moved from the rabbit out to the backing parallel file system.

garlick commented 4 months ago

Maybe a slight modification to @grondo's first idea would be to have an epilog-start event signal (via flag?) that finish should be delayed until the corresponding epilog-finish?

Edit: oh on reread, I guess I was just restating what you already said @grondo - sorry.

garlick commented 4 months ago

Well, the job is complete in a way after finish (i.e. tasks have exited), it's just that data hasn't been moved from the rabbit out to the backing parallel file system.

Is that argument for telling users to add --wait-event=clean manually when they want to wait for the data movement?

grondo commented 4 months ago

Edit: oh on reread, I guess I was just restating what you already said @grondo - sorry.

Yeah, unfortunately the finish event triggers the epilog so chicken and egg.

grondo commented 4 months ago

Well, the job is complete in a way after finish (i.e. tasks have exited), it's just that data hasn't been moved from the rabbit out to the backing parallel file system.

Yeah, it would be nice though if we could account for all "tasks" of a job: compute tasks, data movement, post processing, whatever else, better than we can now. The finish event notes when the job execution system part of the job is complete, but we should support other "parts", e.g. data movement in this case. Optional other subsystems should be able to tell the job manager and other users of the eventlog: hey, I'm doing something too, and then all watchers would know to wait for that thing in addition to the execution system finish notification.

jameshcorbett commented 4 months ago

Well, the job is complete in a way after finish (i.e. tasks have exited), it's just that data hasn't been moved from the rabbit out to the backing parallel file system.

Is that argument for telling users to add --wait-event=clean manually when they want to wait for the data movement?

I think that might lead to a lot of angry users. Brian and Marty and some of the people at HPE (who also use the rabbits) expected flux run by default to wait until data movement was complete.

grondo commented 4 months ago

A side note is that the epilog is really supposed to be for system administration tasks, i.e. the user's part of the job is done and now the resources are being cleaned up to be returned to the system. It would be nice if data movement or other subsystems on behalf of the user was a different operation than an epilog or an epilog that didn't run after the 'finish' event. (Sorry this issue got into a kind of broad topic)

We should also check to ensure that job dependencies would wait until the right state before releasing a dependent job. (I can't remember if afterok:jobid waits for the finish event or until the job is inactive.)

flux-framework / flux-coral2

rabbit interface: add `--wait-event=clean` to `flux run` and `attach` by default #137