cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.06k stars 3.8k forks source link

storage,kvserver: reject writes on low disk space #79210

Open erikgrinaker opened 2 years ago

erikgrinaker commented 2 years ago

As outlined in #74104, we need a mechanism to prevent nodes from running out of disk, since this can crash nodes (making reads unavailable too) and is difficult to recover from.

One option would be for admission control to monitor store capacity and reject incoming writes when it's nearly full. Admission control currently monitors store health (read amp/L0) and throttles/prioritizes incoming writes when overloaded -- monitoring store capacity and rejecting incoming writes seems like a related concern. This could also allow us to differentiate between types of writes, e.g. reject bulk writes at a lower threshold, but allow transactional writes up to a higher threshold, and always allow admin writes to system ranges.

We would also have to take follower writes into account, such that we don't run out of disk because of writes that were accepted on the separate leaseholder node (a similar problem exists for overload protection). The allocator also comes into play here, since it should ideally balance disk usage across nodes.

Jira issue: CRDB-14638

erikgrinaker commented 2 years ago

This has already been implemented specifically for AddSSTable in #78541. This proposal is a generalization of that.

jbowens commented 2 years ago

Would doing the limiting in admission control allow for rejecting writes to a single zone configuration? I think the problem of different zone configurations with different constraints is interesting. One zone may exhaust its available disk space, but you might not want that to affect the availability of ranges in other zones. If the constraints allow, I think replicas that don't need to be on the out-of-disk nodes would already be shed in advance of out-of-disk assuming we're able to shed them fast enough.

erikgrinaker commented 2 years ago

I think that'd be a question for @irfansharif and @cockroachdb/kv-distribution.

What if multiple zone configs apply to a store? Would that imply that we allocate disk space quotas across zones that share stores? Or only set the relative prioritization of zones on a store when it's near full? Using zone configs would give us the system range prioritization/isolation, and more flexibility in general. It could possibly also address tenant isolation. Quotas, however, seem like a harder problem to solve than simply rejecting writes when the store fills up.

andrewbaptist commented 2 years ago

We should assume that the replicate_queue will keep stores "as even as possible" and stores should reject writes if they get below 5% usage. If a store does get below 5% rejecting is the right thing to do (doesn't have to be admission control as it is not performance related). There should be lots of red flags to admins so they don't normally get in this situation. The replicate_queue and operations like decommissioning and scatter need to take this into account (@AlexTalks) so that it doesn't try and move snapshots to a store that is already getting low on space.

erikgrinaker commented 2 years ago

I think we'll at the very least need to differentiate between system writes and user writes here, since we need to be able to make system writes in order to e.g. move leases around.

andrewbaptist commented 2 years ago

Really good point - yes it would definitely be best to differentiate - although there may be a different threshold (1%) where we even reject system writes to prevent system instability (and no chance of recovery). Possibly in that case we shut down the store or at least stop heart beating so that leases are forcefully transferred off