cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.89k stars 3.77k forks source link

ui: display node upgrade progress in DB Console #67330

Closed thtruo closed 2 years ago

thtruo commented 3 years ago

Observations A customer encountered problems with their large cluster while upgrading their CRDB version. In this scenario, it would have been useful showing the status of the upgrade progress directly in DB Console.

Desired behavior Fortunately, the DB Console overview page includes a breakdown of every node and their respective versions in the node list. We should compute a percentage of upgraded nodes and display that more prominently in this overview page so that users monitoring their cluster have a quick read on how the upgrade is going.

Epic CRDB-10792

Jira issue: CRDB-8488

thtruo commented 3 years ago

cc @Annebirzin FYI this came up in internal conversations because it impacted a customer. Do you have any thoughts on where this status can go in the DB Console?

thtruo commented 3 years ago

Assigning to @nkodali for triage

Annebirzin commented 3 years ago

@thtruo currently on the overview page node list, we have a status column that shows 'live, suspect, dead' node status. Maybe we could include a status type on the node list for 'upgrading'?

thtruo commented 3 years ago

@Annebirzin would it be something along these lines?

Image 2021-07-12 at 3 50 51 PM

FWIW putting it under node status seems like a good a home at the moment. I think we just need to be clear about the states, e.g.:

Annebirzin commented 3 years ago

@thtruo ah sorry, I was actually referring to the node list below that module where we list out the nodes and their statuses (right-most column)

I was thinking we could include an orange status badge for 'upgrading' (along with the current live, suspect, dead, decommissioning statuses)

That way you can see which specific nodes are in the process of upgrading. Thoughts?

Screen Shot 2021-07-12 at 5 58 46 PM

thtruo commented 3 years ago

@Annebirzin Gotcha, that makes sense! I think one downside of this design is that for large clusters (say 80+ or 100+ nodes), keeping a pulse on the upgrade status across the cluster is going to be a real challenge. A more prominent summary of an upgrade status would still be needed to avoid a scenario where a user scrolls through a long node list.

(Having said that, this badge still makes sense to include as a status in the list for each node!)

Any other thoughts around where we can more prominently display the overall upgrade status outside of "node status"? What about utilizing the space here? 🤔

Image 2021-07-12 at 6 42 51 PM
Annebirzin commented 3 years ago

@thtruo Good point about the 100+ nodes issue. I agree with including both the summary status (ie. 67% upgraded) and the badge status in the node list.

Feels like the most logical place for the summary status is in the top module under 'node status' as you had in your first screen. I think we'll need to consider layout/responsiveness of this module as we continue to update/add metrics. (another status that came to mind for the top summary module was 'decommissioning nodes')

I can explore the layout and follow up with some ideas.

thtruo commented 3 years ago

Heads up @Annebirzin https://github.com/cockroachdb/cockroach/issues/67667 and https://github.com/cockroachdb/cockroach/issues/67665 seem like they'd inform how we present "progress" across other surface areas within the DB Console UI. Perhaps it's worth us adopting fractions to portray node update progress, either alongside or in lieu of using a %. Thoughts?

Annebirzin commented 3 years ago

@thtruo got it, yep that makes sense. I've put together some designs for adding the 'upgrading node' status to the top summary module and node list here: https://www.figma.com/file/OzoUuhNk05nFuaZB6UNVBx/21.2_obsrv-node-list?node-id=679%3A3917

I also include designs for how this module should behave at different screen sizes.

One question about how we display the node counts. In version a, I show 100 live nodes and 43/100 upgrading. But if a node is upgrading then I'm guessing it can't be live? In that case, I think we would use version b, showing 57 live nodes and 43 upgrading nodes.

Let me know any thoughts

thtruo commented 3 years ago

Thanks Anne!

One question about how we display the node counts. In version a, I show 100 live nodes and 43/100 upgrading. But if a node is upgrading then I'm guessing it can't be live? In that case, I think we would use version b, showing 57 live nodes and 43 upgrading nodes.

I'd be curious how @florence-crl and @mikeczabator think between those versions

thtruo commented 3 years ago

@Annebirzin during a rolling upgrade, while a node is updating from one version to another, its status will be SUSPECT temporarily before becoming LIVE again. I learned that there's a waiting period of around 5 minutes where a SUSPECT node will either be marked as either LIVE or DEAD. In a smooth rolling upgrade, a node should never go from SUSPECT to DEAD to LIVE; it should go from SUSPECT to LIVE

While a cluster is upgrading, it's actually in a mixed version state. An operator would want to know how much of their cluster has safely upgraded. Knowing this, I have additional feedback following this:

One question about how we display the node counts. In version a, I show 100 live nodes and 43/100 upgrading. But if a node is upgrading then I'm guessing it can't be live? In that case, I think we would use version b, showing 57 live nodes and 43 upgrading nodes.

IIUC during a rolling upgrade, only one node is being upgraded at a time; the next node will only start to upgrade once the current one is done. That means we'd most likely be reporting 99 live nodes throughout the process. Perhaps a tweaked version a is sufficient; we show 99 live nodes and 43/100 upgrading (this gives signal into how much of the cluster is still in a mixed version state). We won't ever have a situation like version b where we have <99 live nodes.

Annebirzin commented 3 years ago

ah I see, that all makes sense. I've removed version B and updated version A to reflect the correct upgrading counts. I updated the colors from orange to green to show nodes that have successfully upgraded and included a tooltip (copy tbd). Also included a state when there is no an upgrade in progress, showing that we would hide the count.

Also good to know that only one node upgrades at a time. In that case, it seems like we don't need to introduce the 'upgrading' status in the node list since that upgrading node will be listed as suspect.

Thoughts?

thtruo commented 3 years ago

Nice, thanks @Annebirzin this looks good to me 👍

Also good to know that only one node upgrades at a time. In that case, it seems like we don't need to introduce the 'upgrading' status in the node list since that upgrading node will be listed as suspect. +1. Good point, no need to add a new status in the node list in this case

ajwerner commented 2 years ago

I feel like this just needs stricter definitions. We have a wealth of data sources to tell what binary versions are running and when they were deployed (like, for example, the system.eventlog)

Percent of nodes running a newer version also seems like a very easy thing for us to surface. I feel like the decision to try to couple this to the actual user story of their upgrade automation is fraught and is bringing in more complexity than we need. Maybe if we wanted to give the user more insight into how it's going, we could make some detailed visualization of when different nodes started at different versions. Like, using your imagination, what if we had some chart that was like time in the x axis and each row was a node and then colors represented the version they were running?

n1 ----------   .....................
n2 -------------  ..................

I'm no designer (obviously)

Annebirzin commented 2 years ago

@Santamaura @tommy As discussed in our weekly, I've added a screen for showing a 'Mixed Version' alert banner on the overview page. Let me know any thoughts: https://www.figma.com/file/OzoUuhNk05nFuaZB6UNVBx/21.2_obsrv-dbconsole-overview?node-id=1185%3A12568 Screen Shot 2022-02-22 at 3 36 42 PM