Open jedel1043 opened 1 month ago
Hmm... so this one is intentional. This key being synchronized across the cluster is crucial to Slurm's functionality. If your munge key gets out of sync during a refresh, your entire cluster will collapse. No exit code is emitted by Slurm as well if the keys do not match. The Slurm daemons will still be marked as active even though they cannot communicate with each other. My concern here is with being able to do controlled refreshes of the key. This is the typical flow I've seen for refreshing the munge key in a Slurm cluster:
I think having the user explicitly restart munged when they're ready after all the keys have been set into position is better than doing it automatically in the configure hook since we can't necessarily guarantee how the user will go about setting the new key if they're just using the snap.
What if we included visual feedback in the shell indicates that the munged service needs to be restarted after setting a new key? There's already a message sent to the hooks log in $SNAP_COMMON
:
$ snap set slurm munge.key=<key>
INFO: service `slurm.munged` must be restarted for latest key to take effect
This way we make it clear to the user that they need to restart munged for their latest changes to take effect, and gives them more control over the refresh. Also, less chance of us eating their cluster unintentionally. Note that we can set our own refresh policy within the Slurm charms, so it's relatively inexpensive for us to set the new key and restart the service when we're ready from charm code.
Also, if we go ahead with the enhancement proposed in https://github.com/charmed-hpc/slurm-snap/issues/14, I will likely remove the option to configure the munge key using snap set ...
and snap get ...
since it could introduce coherency issues.
Related to #14.
Running
snap set slurm munge.key=<KEY>
does not automatically restart the munged service. The user has to manually runsnap restart slurm.munged
in order for munged to pick up the new key.