Stop daemon on compute nodes after jobtap prolog completes

jameshcorbett commented 4 months ago

There are clientmount daemons running on every compute node to handle the mounting and unmounting of rabbit file systems. The daemons produce noise, and there have been some investigations lately into how to reduce it. In theory the daemons only need to be running when there are file systems to mount or unmount, at the beginning and end of jobs.

The HPE rabbit team would like Flux to start the daemons when a job finishes and stop them right before the job starts to run, so that the daemons are never running on a node at the same time as a job. The daemons are to be stopped by executing a systemctl stop and started with a systemctl start.

The logic to start the daemons could easily go in the administrative epilog. However, the daemons must not be stopped until they have finished mounting their file systems which will happen some time after the RUN state is reached, so the command to stop them cannot be issued arbitrarily by the administrative prolog. The daemons are only guaranteed to have finished their work when the job's k8s Workflow resource goes to PreRun: Ready: True. That corresponds to the dws-prolog jobtap prolog action completing and the dws_environment event being posted to the job's eventlog.

One solution would be to add a final bit of logic to the administrative prolog, something like

if job_uses_rabbits:
    flux job wait-event dws_environment
    systemctl stop nnf-clientmount
else:
    systemctl stop nnf-clientmount

(since if the job doesn't use rabbits, the dws_environment event will not be posted.)

Thoughts @grondo ? Would there be performance issues from having every node read the eventlog? (I hope not, because I think we do some coral2 plugins do this elsewhere.)

At any event we cannot check the k8s Workflow resource from the administrative prolog because reducing some of the pressure on k8s is one of the things the HPE team is trying to accomplish by having Flux start and stop the daemons.

roehrich-hpe commented 4 months ago

"reducing some of the pressure on k8s is one of the things the HPE team is trying to accomplish by having Flux start and stop the daemons"

We want to stop the daemons while the compute job is running, to avoid introducing jitter on the compute node.

grondo commented 4 months ago

Thoughts @grondo ? Would there be performance issues from having every node read the eventlog? (I hope not, because I think we do some coral2 plugins do this elsewhere.)

We should get more opinions, but I don't think the performance impact of wait-event should be too bad. It would be easy enough to test a worst case scenario.

IIRC, each broker has its own kvs-watch module and local KVS cache, so I think the load will be mostly distributed...

@chu11 @garlick - any concerns?

garlick commented 4 months ago

I think that should be OK! A quick test might be a good idea though. We've been surprised on el cap before :-)

jameshcorbett commented 4 months ago

I noticed that if a job is canceled during a prolog, the epilog does not run. This can mean that the nnf daemons are never started again. So the next rabbit job that comes in will end up stuck, because it will wait for the file systems to mount and the daemons that are supposed to handle the mounting are not running.

One solution would be to change flux-core to always run the epilog. See https://github.com/flux-framework/flux-core/issues/6055

Another would be to change the rabbit prolog above to start the daemons just in case, something like

# just in case they aren't already running
systemctl start nnf-clientmount

if job_uses_rabbits:
    flux job wait-event dws_environment
    systemctl stop nnf-clientmount
else:
    systemctl stop nnf-clientmount

jameshcorbett commented 3 months ago

This now is in place and working with flux-core <= 0.63.0. However, the new housekeeping service in 0.64.0 breaks the epilog.

The problem is that the housekeeping service runs after the dws jobtap epilog action, not at the same time, as the old epilog infrastructure did. Since the dws jobtap action cannot complete until the start_nnf_services housekeeping script runs (which starts the nnf-clientmount service the dws-epilog action indirectly depends on) this leads to job deadlock, with housekeeping waiting on the dws-epilog and vice versa.

@garlick do you see any ways around this? Is there a way we can use the old epilog infrastructure for this one script?

garlick commented 3 months ago

We could configure epilog as it was before to run just that script.

jameshcorbett commented 3 months ago

Excellent, any pointers on how I could configure that? Add the job-manager.epilog section back to job-manager.toml and, add a new directory somewhere with just that script, and point the epilog at it?

garlick commented 3 months ago

Yes. The other component is the IMP [run.epilog] table that points to the /etc/flux/system/epilog script (still in place I think).

The epilog/housekeeping scripts in /etc/flux/system are a little confusing on our systems because housekeeping runs epilog.real, as does epilog, which is currently unused. epilog should be modified to run a new directory of scripts. However, that's going to make things really confusing since we'll have two directories with epilog in the name. This might need a little thought by the sys admins about how they want to organize things.

Flux expects that the entry points for prolog, epilog, and housekeeping are the scripts of the same name in /etc/flux/system. I would suggest not changing that. If we enable running prolog and epilog under systemd, the unit scripts expect those paths.

jameshcorbett commented 3 months ago

The epilog I want to run is just

systemctl start nnf-clientmount
systemctl stop nnf-dm

If /etc/flux/system/epilog is currently unused, maybe I could just change the contents to those two lines? Rather than adding a new directory.

roehrich-hpe commented 3 months ago

James, the other way (minor nit, to reduce the apiserver load before adding more):

systemctl stop nnf-dm
systemctl start nnf-clientmount

And at the prolog, stop one before starting the other as well.

jameshcorbett commented 3 months ago

Good point, will fix.

garlick commented 3 months ago

Works for me!

epilog is controlled by the sys admins + ansible on our systems so check in with them.

Be careful not to activate the epilog in the job-manager config before the script's contents are updated though, or we'll have all that slow gunk running twice, and nobody wants that.

jameshcorbett commented 3 months ago

This is now configured in ansible on all the rabbit systems and seems to be working.

trws commented 3 months ago

We had a chat about some aspects of this after the meeting today, and I was wondering why nnf-clientmount needs to be started at the end of a job like this. Can it not be socket-triggered in the systemd unit or otherwise triggered when needed? I'm trying to figure out that and why these need to happen before the housekeeping phase.

jameshcorbett commented 3 months ago

At the end of a job that uses the rabbits, nnf-clientmount needs to run to unmount the rabbit file systems. That unmounting needs to happen before the housekeeping phase because there is currently a jobtap epilog added by a jobtap plugin in this repo that is only released when all the rabbit resources have been cleaned up, which includes having the compute nodes unmounted. Housekeeping runs after the epilog completes, which would be too late.

I would be very happy to work to trigger it another way but it needs to be triggered by the finish event. How does socket-triggering of systemd units work?

trws commented 3 months ago

It's one of several trigger methods, but socket triggering is usually used for things like ssh or other servers where you want the daemon to be started when a client connects to a specific port. The thought was that if this is a service that gets a connection from dws, or from somewhere, when it needs to perform an action then we could set it up so it gets launched as a direct result of that connection being made, then shut it down after it's done.

flux-framework / flux-coral2

Stop daemon on compute nodes after jobtap prolog completes #166