cockroachdb / docs

CockroachDB user documentation
https://cockroachlabs.com/docs
Creative Commons Attribution 4.0 International
187 stars 455 forks source link

Clarify hardware checks pre-rolling upgrade #6111

Closed jseldess closed 1 month ago

jseldess commented 4 years ago

Jesse Seldess commented:

In the last bullet of step 3 here, we say the following:

Make sure capacity and memory usage are reasonable for each node. Nodes must be able to tolerate some increase in case the new version uses more resources for your workload. Also go to Metrics > Dashboard: Hardware and make sure CPU percent is reasonable across the cluster. If there's not enough headroom on any of these metrics, consider adding nodes to your cluster before beginning your upgrade.

We need to clarify this guidance. There are two considerations: You want to make sure each node has enough CPU and memory to handle a possible increase in resource usage with the new version. And you want to make sure the cluster as a whole can handle a single node going down at a time, which can cause CPU usage to increase across the cluster.

More insight from @bdarnell:

so there are two things here: you need enough headroom to survive the largest loss that's a part of your fault tolerance model. this one is relatively easy to quantify. and then there's the "healthy margin" around upgrades, which is a hedge in case the new version is less efficient than the old version. it's hard to provide universal guidance here (we never expect the new version to be less efficient than the old, but sometimes it happens) because it comes down to the operator's personal level of caution for the upgrade margin, i'd personally use something like 10-15% (that is, if I'm within that distance of my cluster's red line, i'd add capacity before upgrading)

Jira Issue: DOC-387

jseldess commented 4 years ago

cc: @roncrdb.