Closed derkan closed 2 years ago
We had a chat on gitter with @mberhault . He kindly gave information how CockroachDB's cluster works on multiple datacenters. @mberhault suggested this doc and this doc which were helpful.
But unfortunately, while having 2 datacenters, if one DC is lost, db cluster in other DC will not be available for our services, that means no data is available afterwards. (Same thing for 3 DC cluster if 2 of 3 DC are lost).
I think for CockroachDB, we need a mechanism to promote last available DC to be online if others are down. It maybe manually or automatic. I had issue #32524 in single DC-3 node cluster and still I can not recover some rows from some tables. SQL's on these tables waits forever. Also this forever wait behaviour should not be exist, instead it should raise exception that some ranges are not accessible. Because applications are blocked with these waits.
@tim-o please accompany this discussion
Hey @derkan - it sounds like there are two requests here: 1) Raise an exception if ranges are not accessible, and 2) Support a use case where a cluster can continue running after losing half of its replicas.
Is that correct? If so, the first certainly seems reasonable but the second would be a very large change. We do have work ongoing to allow single replicas to return reads if quorum can't be reached ("follower reads"), but this would not allow writes.
Let me know if that would suit your use case and if I've captured your feedback above.
Thanks!
Thanks @tim-o
The first definitely seems uncontroversial to me. @awoods187 can you comment on the second request above? Does this fall under an existing project or is it net new / unspecced?
@derkan we definitely do want to be fault tolerant, but I'd be concerned that there's no way to break a 'tie' in a world where we allow a cluster to operate with consensus from only 50% of the nodes. I'll leave it to someone more familiar with core to comment on that though.
Falling back to some kind of "inconsistent reads" mode when quorum is lost is very difficult to implement correctly. For starters, none of the internal invariants of the SQL layer are expected to hold in that situation.
I have requalified the issue to become a feature request for a timeout on requests to unavailable ranges.
Unfortunately at this time our internal work backlog is a bit congested, but we'll be on the lookout for this UX improvement. Thank you for suggesting.
@awoods187 @piyush-singh this request pertains to the UX of partially unavailable clusters. Maybe this fits into a larger roadmap picture you already have?
FWIW, this is straightforward to implement. Tentatively putting this on my plate for 2.2, it would also aid our debugging immensely.
@nvanbenschoten also points out that with parallel commits in a single phase, even txns that have their first write to an unavailable range may not realize they should stop there. This is due to pipelining. From that pov it also makes sense to fail-fast so that these txns don't keep their unrelated intents alive. Also, our deadlock detection won't work if txns are anchored on unavailable ranges.
@lunevalex is planning to prototype around this issue. We are considering adding a circuit breaker on Replica that uses suitable signals (TBD) to determine whether the range is unavailable (as in, commands proposed don't get applied in a timely manner). While transactions that are anchored or parallel committed with intents on an unavailable range will forcibly remain open until the range is recovered, we consider there being value in preventing any new operations from blocking on the range to limit the blast radius of the unavailability. This is motivated by a recent real-world outage in which a single range's unavailability caused transactions accessing the unavailable range to block, and in turn keep its intents elsewhere open indefinitely, which caused cascading contention which was difficult to get under control even after the user had taken steps to remove writes to the table from the app.
Now that we have a draft prototype the next step is to create a more robust test that covers the scenario that motivated this work.
From: @tbg
The setup could be done via testcluster, and basically it's: create two tables, accounts and account_history workload does insert into accounts and insert into account_history (maybe we can find a way to make this less synthetic, i.e. motivate why we'd write to both in that order) account_history goes down (just use manual replication and stop a node, or maybe even better, stall the apply loop as you're doing in the PR right now) should see the failfast error starting to pop up from the workload use SQL to switch out the table account history (rename the old table, recreate the new table) workload should make progress again. There's an advanced version of this, where the txn first writes to account_history (and is thus anchored there), in that case the test can't pass reliably, as a txn with intents on both tables won't be abortable, so it will block access to its row on accounts as well.
This issue should also address #60617 (now closed and folded into this one):
We currently repropose inflight entries if they haven't popped out of the raft log within a "reasonable amount of time" (X ticks). Under overload conditions, this can lead to adding more work to an already overloaded system.
When we circuit-break, we should enforce emptiness of the proposals map (other than probes we send to check for recovery).
https://github.com/cockroachlabs/support/issues/939 is an example where internal circuit breakers to downed replicas would've prevented an outage. We hadn't lost quorum there; the follower replicas requesting leases took upwards of minutes to acquire it (something now made better with https://github.com/cockroachdb/cockroach/pull/55148; which fixed https://github.com/cockroachdb/cockroach/issues/37906). Because the inserts could now take minutes to run (the customer didn't make use of statement timeouts), it presented as an outage.
From one of our customers on a recent call:
"Ultimately, if a query will fail because of a range is unavailable, they believe this is information that should be passed up to the app so the app can make the right decision. There is a difference between a statement timing out because it took too long (i.e. contention) and should be tried again, vs a range is unavailable and no amount of re-tries will succeed. So they are objecting to bundling these two different failures into the same timeout. And also that queries should wait for a range that the cluster knows is not available."
Specifically, the error that the application receives should be clear why. As application developer, it's important to if I get a timeout, that I know "timeout and range XYZ is unavailable" or "timeout and node 1,2,3 are down"
(moved to new comment at the bottom)
@mwang1026 I think you wanted to reach out to the home teams re:
edit: also this one https://github.com/cockroachdb/cockroach/issues/75068
Moving earlier checklist comment here, for better visibility.
22.1:
Other issues, either related to per-replica circuit breakers or to the problem area of handling "slow"/unservable requests across the stack, but they're not currently in scope:
Replica circuit breakers are enabled on master and release-22.1. There are follow-up issues filed for future extensions, see above.
Describe the problem
I'm testing multi-region deployment. I've used performance tuning docs, installed multi-datacenter cluster. DC names are
DCPRI
andDCSEC
with 3 virtual machines in each.When I shutdown system in
DCSEC
, cluster onDCPRI
is unresponsive, no SQL's work, just waits. Also, admin GUI timeouts.Is it because my configuration, bug or is it expected behaviour?
Here is start parameters of cockroach instances:
Logs from the nodes in DCPRI DCPRI/Node1:
DCPRI/Node4:
DCPRI/Node3:
Epic: CRDB-2553
Jira issue: CRDB-6349