Open jameshcorbett opened 4 months ago
"reducing some of the pressure on k8s is one of the things the HPE team is trying to accomplish by having Flux start and stop the daemons"
We want to stop the daemons while the compute job is running, to avoid introducing jitter on the compute node.
Thoughts @grondo ? Would there be performance issues from having every node read the eventlog? (I hope not, because I think we do some coral2 plugins do this elsewhere.)
We should get more opinions, but I don't think the performance impact of wait-event
should be too bad. It would be easy enough to test a worst case scenario.
IIRC, each broker has its own kvs-watch module and local KVS cache, so I think the load will be mostly distributed...
@chu11 @garlick - any concerns?
I think that should be OK! A quick test might be a good idea though. We've been surprised on el cap before :-)
I noticed that if a job is canceled during a prolog, the epilog does not run. This can mean that the nnf daemons are never started again. So the next rabbit job that comes in will end up stuck, because it will wait for the file systems to mount and the daemons that are supposed to handle the mounting are not running.
One solution would be to change flux-core to always run the epilog. See https://github.com/flux-framework/flux-core/issues/6055
Another would be to change the rabbit prolog above to start the daemons just in case, something like
# just in case they aren't already running
systemctl start nnf-clientmount
if job_uses_rabbits:
flux job wait-event dws_environment
systemctl stop nnf-clientmount
else:
systemctl stop nnf-clientmount
This now is in place and working with flux-core <= 0.63.0. However, the new housekeeping service in 0.64.0 breaks the epilog.
The problem is that the housekeeping service runs after the dws jobtap epilog action, not at the same time, as the old epilog infrastructure did. Since the dws jobtap action cannot complete until the start_nnf_services
housekeeping script runs (which starts the nnf-clientmount
service the dws-epilog action indirectly depends on) this leads to job deadlock, with housekeeping waiting on the dws-epilog and vice versa.
@garlick do you see any ways around this? Is there a way we can use the old epilog infrastructure for this one script?
We could configure epilog as it was before to run just that script.
Excellent, any pointers on how I could configure that? Add the job-manager.epilog
section back to job-manager.toml
and, add a new directory somewhere with just that script, and point the epilog at it?
Yes. The other component is the IMP [run.epilog]
table that points to the /etc/flux/system/epilog
script (still in place I think).
The epilog/housekeeping scripts in /etc/flux/system
are a little confusing on our systems because housekeeping
runs epilog.real
, as does epilog
, which is currently unused. epilog
should be modified to run a new directory of scripts. However, that's going to make things really confusing since we'll have two directories with epilog
in the name. This might need a little thought by the sys admins about how they want to organize things.
Flux expects that the entry points for prolog, epilog, and housekeeping are the scripts of the same name in /etc/flux/system
. I would suggest not changing that. If we enable running prolog and epilog under systemd, the unit scripts expect those paths.
The epilog I want to run is just
systemctl start nnf-clientmount
systemctl stop nnf-dm
If /etc/flux/system/epilog
is currently unused, maybe I could just change the contents to those two lines? Rather than adding a new directory.
James, the other way (minor nit, to reduce the apiserver load before adding more):
systemctl stop nnf-dm
systemctl start nnf-clientmount
And at the prolog, stop one before starting the other as well.
Good point, will fix.
Works for me!
epilog
is controlled by the sys admins + ansible on our systems so check in with them.
Be careful not to activate the epilog in the job-manager
config before the script's contents are updated though, or we'll have all that slow gunk running twice, and nobody wants that.
This is now configured in ansible on all the rabbit systems and seems to be working.
We had a chat about some aspects of this after the meeting today, and I was wondering why nnf-clientmount needs to be started at the end of a job like this. Can it not be socket-triggered in the systemd unit or otherwise triggered when needed? I'm trying to figure out that and why these need to happen before the housekeeping phase.
At the end of a job that uses the rabbits, nnf-clientmount needs to run to unmount the rabbit file systems. That unmounting needs to happen before the housekeeping phase because there is currently a jobtap epilog added by a jobtap plugin in this repo that is only released when all the rabbit resources have been cleaned up, which includes having the compute nodes unmounted. Housekeeping runs after the epilog completes, which would be too late.
I would be very happy to work to trigger it another way but it needs to be triggered by the finish
event. How does socket-triggering of systemd units work?
It's one of several trigger methods, but socket triggering is usually used for things like ssh or other servers where you want the daemon to be started when a client connects to a specific port. The thought was that if this is a service that gets a connection from dws, or from somewhere, when it needs to perform an action then we could set it up so it gets launched as a direct result of that connection being made, then shut it down after it's done.
There are clientmount daemons running on every compute node to handle the mounting and unmounting of rabbit file systems. The daemons produce noise, and there have been some investigations lately into how to reduce it. In theory the daemons only need to be running when there are file systems to mount or unmount, at the beginning and end of jobs.
The HPE rabbit team would like Flux to start the daemons when a job finishes and stop them right before the job starts to run, so that the daemons are never running on a node at the same time as a job. The daemons are to be stopped by executing a
systemctl stop
and started with asystemctl start
.The logic to start the daemons could easily go in the administrative epilog. However, the daemons must not be stopped until they have finished mounting their file systems which will happen some time after the RUN state is reached, so the command to stop them cannot be issued arbitrarily by the administrative prolog. The daemons are only guaranteed to have finished their work when the job's k8s Workflow resource goes to
PreRun: Ready: True
. That corresponds to thedws-prolog
jobtap prolog action completing and thedws_environment
event being posted to the job's eventlog.One solution would be to add a final bit of logic to the administrative prolog, something like
(since if the job doesn't use rabbits, the
dws_environment
event will not be posted.)Thoughts @grondo ? Would there be performance issues from having every node read the eventlog? (I hope not, because I think we do some coral2 plugins do this elsewhere.)
At any event we cannot check the k8s Workflow resource from the administrative prolog because reducing some of the pressure on k8s is one of the things the HPE team is trying to accomplish by having Flux start and stop the daemons.