Scripts/Cronjobs for Galaxy cluster

sandertyu commented 3 years ago

Currently there are 2 cronjobs running on the gravity management node; one which scrubs the file server blackhole each month and reports back the findings, and another which emails us every 4 months to upgrade the cluster (view these with sudo crontab -e on gravity). Some suggested changes;

[ ] I feel like the current suggestion to upgrade the router when we upgrade the rest of the cluster is a little infrequent. When rooster was our gateway we upgraded it each week, but granted, rooster was doing more work than our current routers are. With a highly available setup we might be able to upgrade more frequently doing each one at a time to ensure there is no downtime. This should probably be done manually, but we could have a notification script.
[ ] If we don't want to do the above, also mention upgrading the firewall/router in the cluster upgrade email script
[ ] potentially create a script which notifies us when the nodes need to be restarted to apply certain automatic Ubuntu updates. this rebooting process can even be automated, but that sounds dangerous. the cluster is highly available so we could apply these updates by rebooting more often than the cluster upgrade reminder suggests, but we could also just do it all at the same time. more info here

rkevin-arch commented 3 years ago

potentially create a script which notifies us when the nodes need to be restarted to apply certain automatic Ubuntu updates.

If you want to, you can determine if you need to reboot by looking at the file /var/run/reboot-required, and the /var/run/reboot-required.pkgs file tells you which packages are responsible for it.

The cluster is highly available so we could apply these updates by rebooting more often.

I don't think this is a good idea, because even when the cluster itself is HA, jupyterhub is not. If you restart a node with the hub pod, the hub pod will become inaccessible until it's respawned. Also, if we kill user pods, they won't autorespawn and the user could genuinely lose data. I think bundling this with cluster upgrades would be better, or we have to come up with a fairly complicated system to determine which nodes are safe to cordon and only update those.

pmackle commented 3 years ago

I'll try tackling this

sandertyu commented 3 years ago

We've got a satisfactory number of automated scripts, and they are all set up through puppet. Closing.

LibreTexts / metalc

Scripts/Cronjobs for Galaxy cluster #224