SUSE / DeepSea

A collection of Salt files for deploying, managing and automating Ceph.
GNU General Public License v3.0
161 stars 75 forks source link

Add status function to advise on upgrade order #1730

Closed tserong closed 5 years ago

tserong commented 5 years ago

This is similar to status.report, but specifically intended to advise the user on what order to perform an upgrade in. The underlying assumption is that the admin node has already been upgraded to the latest possible version. Then, when you run salt-run upgrade.status, you'll see similar to one of the following:

1) If everything is the same version:

All nodes are running:
  ceph: ceph version 14.2.1-468-g994fd9e0cc (994fd9e0ccc50c2f3a55a3b7a3d4e0ba74786d50) nautilus (stable)
  os: SUSE Linux Enterprise Server 15 SP1

2) If there's a mix of versions:

The newest installed software versions are:
  ceph: ceph version 14.2.1-468-g994fd9e0cc (994fd9e0ccc50c2f3a55a3b7a3d4e0ba74786d50) nautilus (stable)
  os: SUSE Linux Enterprise Server 15 SP1

Nodes running these software versions:
  admin.ceph (roles: master)
  mon2.ceph (roles: admin, mon, mgr)

Nodes running older software versions must be upgraded in the following order:
   1: mon1.ceph (roles: admin, mon, mgr)
   2: mon3.ceph (roles: admin, mon, mgr)
   3: data1.ceph (roles: storage)
   4: data2.ceph (roles: storage)
   5: data3.ceph (roles: storage)
   6: data4.ceph (roles: storage)
   7: data5.ceph (roles: storage)

3) If there's a mix of versions and some node(s) are down (which could be becuase they're being upgraded at the time):

The newest installed software versions are:
  ceph: ceph version 14.2.1-468-g994fd9e0cc (994fd9e0ccc50c2f3a55a3b7a3d4e0ba74786d50) nautilus (stable)
  os: SUSE Linux Enterprise Server 15 SP1

Nodes running these software versions:
  admin.ceph (roles: master)
  mon2.ceph (roles: admin, mon, mgr)

Nodes running older software versions must be upgraded in the following order:
   1: mon3.ceph (roles: admin, mon, mgr)
   2: data1.ceph (roles: storage)
   3: data2.ceph (roles: storage)
   4: data3.ceph (roles: storage)
   5: data4.ceph (roles: storage)
   6: data5.ceph (roles: storage)

Unable to contact these nodes (node down or Salt minion inactive?):
  mon1.ceph

This PR notably does not yet include any unit tests, because I wanted to get feedback on the output and the behaviour first in case it turns out I have to change everything :-) and also on whether I should have just folded this into the existing status.report function somehow.


Checklist:

tserong commented 5 years ago

(yeah, I'll fix that make lint stuff too...)

jschmid1 commented 5 years ago

Code looks good overall, minus some linter kinks here and there.

This will definitely improve the upgrade process by a lot.

swiftgist commented 5 years ago

I think the code looks fine and the report is useful.

Take this next part with a grain of salt. :) I am thinking out loud and trying to imagine those admins with the least Ceph and Salt experience. Will there be any confusion with

   1: mon1.ceph (roles: admin, mon, mgr)
   2: mon3.ceph (roles: admin, mon, mgr)

that the horizontal list should be executed on mon1 prior to the same list on mon3? Do we have any Sales Engineers that will ask "is it okay to do the admin role on both prior to the other roles?"

While I imagine we cannot prevent everything, maybe the report should list each role individually. While the numbering would imply "read this from top to bottom", explicitly listing

  1: mon1.ceph - admin
  2:                      mon
  3:                      mgr

I don't know how well this communicates shared roles where a minion is both a storage and rgw role and upgrading ceph-common impacts the latter. And that's likely not the best formatting. So, just food for thought.

tserong commented 5 years ago

Will there be any confusion [...] that the horizontal list should be executed on mon1 prior to the same list on mon3?

Hrm. Good point. I listed the roles intending to be informative (trying to indicate that "this node has these roles assigned"), rather than suggesting that particular roles could be upgraded independently of each other on a single node (which they can't right now for SES5->6; independent role upgrades would need containerization)

tserong commented 5 years ago

Fixed the linter kinks, also changed the output to say "assigned roles: (...)" rather than "roles" in an attempt to indicate that this is just telling you what the node does, and not indicating that one should attempt to update roles on one node independently of each other.

jschmid1 commented 5 years ago

looks good!