cockroachdb / docs

CockroachDB user documentation
https://cockroachlabs.com/docs
Creative Commons Attribution 4.0 International
188 stars 456 forks source link

Rolling upgrade in a multi-region cluster #7733

Open jseldess opened 4 years ago

jseldess commented 4 years ago

Jesse Seldess commented:

In Slack, @holtrdan asked for the best practice when upgrading a multi-region cluster. Which is best?

  1. Upgrade all nodes in a region before moving on to the other regions.
  2. Distribute the upgrade across regions evenly (node in region 1 > node in region 2 > node in region 3 > node in region 1 > node in region 2 > etc.)

@BramGruneir suggested option 1. We should validate and add a note to our upgrade docs, e.g, here.

We should also define how application traffic is a part of this. For option 1, for example, @dbist mentioned that a customer he's working with has no application traffic in a region while upgrading that region.

Jira Issue: DOC-591

Epic DOC-11047

jseldess commented 4 years ago

@taroface and @johnrk for triage and prioritization.

Also cc @joshimhoff: How do we handle CC multi-region cluster upgrades?

@bdarnell, any opinions?

bdarnell commented 4 years ago

I would also tend to go region-by-region, although I don't think it makes a lot of difference and I'm mainly basing this on the fact that orchestration tooling is more likely to facilitate region-by-region upgrades instead of spreading it out more evenly.

For option 1, for example, @dbist mentioned that a customer he's working with has no application traffic in a region while upgrading that region.

If you can drain all traffic from a region while upgrading, that's probably a good idea. But if you're using geo-partitioning, that won't really be an option since that region will still need to serve traffic for ranges that are pinned there.

dbist commented 4 years ago

If you can drain all traffic from a region while upgrading, that's probably a good idea. But if you're using geo-partitioning, that won't really be an option since that region will still need to serve traffic for ranges that are pinned there.

that's a good point, this particular customer does not use geo-partitioning because some of their clusters are on core version.

joshimhoff commented 4 years ago

I would also tend to go region-by-region, although I don't think it makes a lot of difference and I'm mainly basing this on the fact that orchestration tooling is more likely to facilitate region-by-region upgrades instead of spreading it out more evenly.

We go one node at a time, starting with region 1, then onto region 2, etc.

If you can drain all traffic from a region while upgrading, that's probably a good idea. But if you're using geo-partitioning, that won't really be an option since that region will still need to serve traffic for ranges that are pinned there.

We don't do this on CC. We upgrade one node at a time; other nodes in the region keep serving.

taroface commented 4 years ago

Relates to #5780.

exalate-issue-sync[bot] commented 1 year ago

linville (mdlinville) commented: It sounds like the recommendation here is not to drain traffic in the region, but to go per-region and within a region, go per-node. If it’s working for CC, it seems like a safe recommendation. Is that correct? If so, I can get this recommendation into the docs.

exalate-issue-sync[bot] commented 1 year ago

linville (mdlinville) commented: Bram Gruneir Coming back to this to see if the situation is still the same as in the description, whether it has been validated, etc? Any pointers?