SUSE / DeepSea

A collection of Salt files for deploying, managing and automating Ceph.
GNU General Public License v3.0
160 stars 75 forks source link

runners/upgrade: enable automated upgrade if cluster is on nautilus #1777

Closed tserong closed 4 years ago

tserong commented 4 years ago

This re-allows ceph.maintenance.upgrade, provided all nodes are running Nautilus, so that function can be used to apply minor updates again.

Signed-off-by: Tim Serong tserong@suse.com


Checklist:

tserong commented 4 years ago

I've labelled this DNM, as I haven't actually tested ceph.maintenance.upgrade yet on Nautilus, and I suspect nobody else has either. I mostly just want to get feedback on two things:

1) Is this check sensible?

2) What's up with this? https://github.com/SUSE/DeepSea/blob/079ad63d6593a9db6cf55a60987a2472aa18351d/srv/modules/runners/upgrade.py#L125-L126 It looks like the colocated_services check was never enabled, but the comment at https://github.com/SUSE/DeepSea/blob/079ad63d6593a9db6cf55a60987a2472aa18351d/srv/modules/runners/upgrade.py#L17-L26 seems to suggest it should be...

swiftgist commented 4 years ago

I think the check is fine, but should we have more specific error messages for multiple OS or ceph versions? A single minion that is down would also give the "upgrade is not supported".

tserong commented 4 years ago

Good point. Also I just realised that multiple Ceph versions are fine, provided they're within the same major release (imagine someone somehow gets half way through applying minor updates, then something breaks, then they need to run the update again). I'll rework this a bit.

tserong commented 4 years ago

While poking around further I discovered that f2e85228 changed the logic in /srv/salt/ceph/maintenance/upgrade/master/default.sls. Previously, the upgrade would only run if minions.ready, upgrade.check and validate.setup all returned True. Now, since that commit, if minions.ready returns False, the upgrade proceeds anyway, and does not call upgrade.check or validate.setup. @jschmid1 (or anyone else), can you recall the reason for this change?

tserong commented 4 years ago

OK, that's a better check and error messages.

Still need to figure out what's up with f2e8522. I don't think the behaviour introduced there is correct. Possibly the intention was to wait a while if some nodes are down in case they come back, but then to proceed with the upgrade anyway? If that's the case, surely it should still run upgrade.check and validate.setup?

jschmid1 commented 4 years ago

Closing.. Reopen if needed