cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.1k stars 3.81k forks source link

sql/schemachanger: Need Size Estimation & Warnings for Schema Changes #125861

Open rafiss opened 4 months ago

rafiss commented 4 months ago

Copying over a request from @kevinkokomani :

Twice in the last week, we have had customer run into near-outages or actual P1 incidents when running ALTER TABLE ADD COLUMN on a large table in their production clusters.

Under certain conditions, running this will rewrite the entire table. If the customer does not have enough disk capacity, the cluster can quickly run out of capacity and halt their workload or worse, start bringing down nodes/cause range unavailability.

This schema change can look fairly benign and simple to a customer that does not know better.

I think we need to give customers warnings and size estimations when they do jobs like these. If the size estimation is more than their disk capacity, then we should warn them and make them confirm before they proceed with the job, or stop them from proceeding with the job at all without increasing their disk capacity at least temporarily.

I could see customers complaining about needing to increase disk capacity for these ADD COLUMN jobs, but that seems like a separate issue to chase down.

Jira issue: CRDB-39640

Epic CRDB-40071

arulajmani commented 4 months ago

This came up in the KV on-call meeting as well when discussing https://github.com/cockroachlabs/support/issues/2987. It's hard to account for all cases, but a best effort check that compares disk capacity across the cluster with table size should be helpful here.

rafiss commented 4 months ago

There was a related question from this escalation: https://github.com/cockroachlabs/support/issues/2994

There's also a lot of discussion about this in this docs issue: https://cockroachlabs.atlassian.net/browse/DOC-10412

A big challenge is that data may not be evenly distributed across all the stores, but we likely can make the warnings from CRDB more prominent in the case of really large tables.

rimadeodhar commented 2 months ago

Copying over a similar issue faced on our DRT cluster recently: https://cockroachlabs.slack.com/archives/C05FHJJ0MD0/p1724667313762119