charmed-hpc / slurm-snap

Snap package for Slurm. Slurm is a highly scalable cluster management and job scheduling system for large and small Linux clusters :balance_scale::penguin:
https://slurm.schedmd.com
Apache License 2.0
2 stars 3 forks source link

Setting munge key does not restart the munged service #15

Open jedel1043 opened 1 month ago

jedel1043 commented 1 month ago

Related to #14.

Running snap set slurm munge.key=<KEY> does not automatically restart the munged service. The user has to manually run snap restart slurm.munged in order for munged to pick up the new key.

NucciTheBoss commented 4 weeks ago

Hmm... so this one is intentional. This key being synchronized across the cluster is crucial to Slurm's functionality. If your munge key gets out of sync during a refresh, your entire cluster will collapse. No exit code is emitted by Slurm as well if the keys do not match. The Slurm daemons will still be marked as active even though they cannot communicate with each other. My concern here is with being able to do controlled refreshes of the key. This is the typical flow I've seen for refreshing the munge key in a Slurm cluster:

  1. Update the munge key on the main Slurm controller (slurmctld)
  2. Propagate the key out to the other Slurm daemons (slurmd, slurmdbd, slurmrestd)
  3. Restart munged cluster wide

I think having the user explicitly restart munged when they're ready after all the keys have been set into position is better than doing it automatically in the configure hook since we can't necessarily guarantee how the user will go about setting the new key if they're just using the snap.

What if we included visual feedback in the shell indicates that the munged service needs to be restarted after setting a new key? There's already a message sent to the hooks log in $SNAP_COMMON:

$ snap set slurm munge.key=<key>
INFO: service `slurm.munged` must be restarted for latest key to take effect

This way we make it clear to the user that they need to restart munged for their latest changes to take effect, and gives them more control over the refresh. Also, less chance of us eating their cluster unintentionally. Note that we can set our own refresh policy within the Slurm charms, so it's relatively inexpensive for us to set the new key and restart the service when we're ready from charm code.

NucciTheBoss commented 4 weeks ago

Also, if we go ahead with the enhancement proposed in https://github.com/charmed-hpc/slurm-snap/issues/14, I will likely remove the option to configure the munge key using snap set ... and snap get ... since it could introduce coherency issues.