derkan commented 5 years ago

Describe the problem

I'm testing multi-region deployment. I've used performance tuning docs, installed multi-datacenter cluster. DC names are DCPRI and DCSEC with 3 virtual machines in each.
When I shutdown system in DCSEC, cluster on DCPRI is unresponsive, no SQL's work, just waits. Also, admin GUI timeouts.

Is it because my configuration, bug or is it expected behaviour?

Here is start parameters of cockroach instances:

# DCPRI/Node 1
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcpri --listen-addr=10.35.14.101 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25
# DCPRI/Node 4
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcpri --listen-addr=10.35.14.102 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25
# DCPRI/Node 3
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcpri --listen-addr=10.35.14.103 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25
# Node 4 DCSEC
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcsec --listen-addr=10.35.14.104 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25
# Node 5 DCSEC
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcsec --listen-addr=10.35.14.105 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25
# Node 6 DCSEC
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcsec --listen-addr=10.35.14.106 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25

Logs from the nodes in DCPRI DCPRI/Node1:

W181210 15:30:45.781750 136672 vendor/google.golang.org/grpc/clientconn.go:942  Failed to dial 10.10.14.105:26257: context canceled; please retry.
W181210 15:30:45.781790 139 storage/store.go:3910  [n1,s1] handle raft ready: 1.3s [processed=0]
I181210 15:30:46.795946 136580 server/authentication.go:374  Web session error: http: named cookie not present
I181210 15:30:46.799351 136228 server/authentication.go:374  Web session error: http: named cookie not present
W181210 15:30:46.909608 136729 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.104:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.10.14.104:26257: connect: no route to host". Reconnecting...
W181210 15:30:46.909690 136729 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.104:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W181210 15:30:46.909704 136729 vendor/google.golang.org/grpc/clientconn.go:942  Failed to dial 10.10.14.104:26257: grpc: the connection is closing; please retry.
W181210 15:30:46.909709 142 storage/store.go:3910  [n1,s1] handle raft ready: 1.8s [processed=0]

DCPRI/Node4:

W181210 15:31:22.968999 71750 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.105:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.10.14.105:26257: connect: no route to host". Reconnecting...
I181210 15:31:22.969066 136 rpc/nodedialer/nodedialer.go:91  [n4] unable to connect to n6: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.10.14.105:26257: connect: no route to host"
W181210 15:31:22.969068 71750 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.105:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W181210 15:31:22.969084 136 storage/store.go:3910  [n4,s4] handle raft ready: 1.1s [processed=0]
W181210 15:31:22.969095 71750 vendor/google.golang.org/grpc/clientconn.go:942  Failed to dial 10.10.14.105:26257: context canceled; please retry.
W181210 15:31:23.772921 71770 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.106:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.10.14.106:26257: connect: no route to host". Reconnecting...
W181210 15:31:23.772980 71770 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.106:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W181210 15:31:23.772988 71770 vendor/google.golang.org/grpc/clientconn.go:942  Failed to dial 10.10.14.106:26257: grpc: the connection is closing; please retry.
W181210 15:31:23.773029 112 storage/store.go:3910  [n4,s4] handle raft ready: 1.3s [processed=0]

DCPRI/Node3:

I181210 15:31:50.649944 139 server/status/runtime.go:465  [n3] runtime stats: 253 MiB RSS, 224 goroutines, 140 MiB/18 MiB/175 MiB GO alloc/idle/total, 56 MiB/72 MiB CGO alloc/total, 24.1 CGO/sec, 0.7/0.3 %(u/s)time, 0.0 %gc (0x), 108 KiB/80 KiB (r/w)net
W181210 15:31:50.941579 69897 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.106:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.10.14.106:26257: connect: no route to host". Reconnecting...
W181210 15:31:50.941670 69897 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.106:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W181210 15:31:50.941683 69897 vendor/google.golang.org/grpc/clientconn.go:942  Failed to dial 10.10.14.106:26257: grpc: the connection is closing; please retry.
W181210 15:31:51.217542 69854 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.105:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.10.14.105:26257: connect: no route to host". Reconnecting...
W181210 15:31:51.217615 31 storage/store.go:3910  [n3,s3] handle raft ready: 1.2s [processed=0]
W181210 15:31:51.217622 69854 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.105:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W181210 15:31:51.217648 69854 vendor/google.golang.org/grpc/clientconn.go:942  Failed to dial 10.10.14.105:26257: context canceled; please retry.

Epic: CRDB-2553

Jira issue: CRDB-6349

derkan commented 5 years ago

We had a chat on gitter with @mberhault . He kindly gave information how CockroachDB's cluster works on multiple datacenters. @mberhault suggested this doc and this doc which were helpful.

But unfortunately, while having 2 datacenters, if one DC is lost, db cluster in other DC will not be available for our services, that means no data is available afterwards. (Same thing for 3 DC cluster if 2 of 3 DC are lost).

I think for CockroachDB, we need a mechanism to promote last available DC to be online if others are down. It maybe manually or automatic. I had issue #32524 in single DC-3 node cluster and still I can not recover some rows from some tables. SQL's on these tables waits forever. Also this forever wait behaviour should not be exist, instead it should raise exception that some ranges are not accessible. Because applications are blocked with these waits.

knz commented 5 years ago

@tim-o please accompany this discussion

tim-o commented 5 years ago

Hey @derkan - it sounds like there are two requests here: 1) Raise an exception if ranges are not accessible, and 2) Support a use case where a cluster can continue running after losing half of its replicas.

Is that correct? If so, the first certainly seems reasonable but the second would be a very large change. We do have work ongoing to allow single replicas to return reads if quorum can't be reached ("follower reads"), but this would not allow writes.

Let me know if that would suit your use case and if I've captured your feedback above.

Thanks!

derkan commented 5 years ago

Thanks @tim-o

Raise an exception for unaccessible ranges : Yes. It is better than digging in logs for the cause of wating and locking services using DB(of course with a reasonable timeout).
Make cluster continue running if half of the replicas are lost: Yes. We're 'cockroach`, right? :-) It makes sense for allowing reads in this situation, giving time to DBA to add new nodes to cluster. And an important thing is, it will allow DBA to make a backup. If it is possible, maybe active nodes can let writing to the replicas which they own?

tim-o commented 5 years ago

The first definitely seems uncontroversial to me. @awoods187 can you comment on the second request above? Does this fall under an existing project or is it net new / unspecced?

@derkan we definitely do want to be fault tolerant, but I'd be concerned that there's no way to break a 'tie' in a world where we allow a cluster to operate with consensus from only 50% of the nodes. I'll leave it to someone more familiar with core to comment on that though.

tbg commented 5 years ago

Falling back to some kind of "inconsistent reads" mode when quorum is lost is very difficult to implement correctly. For starters, none of the internal invariants of the SQL layer are expected to hold in that situation.

knz commented 5 years ago

I have requalified the issue to become a feature request for a timeout on requests to unavailable ranges.

Unfortunately at this time our internal work backlog is a bit congested, but we'll be on the lookout for this UX improvement. Thank you for suggesting.

@awoods187 @piyush-singh this request pertains to the UX of partially unavailable clusters. Maybe this fits into a larger roadmap picture you already have?

tbg commented 5 years ago

FWIW, this is straightforward to implement. Tentatively putting this on my plate for 2.2, it would also aid our debugging immensely.

tbg commented 3 years ago

@nvanbenschoten also points out that with parallel commits in a single phase, even txns that have their first write to an unavailable range may not realize they should stop there. This is due to pipelining. From that pov it also makes sense to fail-fast so that these txns don't keep their unrelated intents alive. Also, our deadlock detection won't work if txns are anchored on unavailable ranges.

tbg commented 3 years ago

@lunevalex is planning to prototype around this issue. We are considering adding a circuit breaker on Replica that uses suitable signals (TBD) to determine whether the range is unavailable (as in, commands proposed don't get applied in a timely manner). While transactions that are anchored or parallel committed with intents on an unavailable range will forcibly remain open until the range is recovered, we consider there being value in preventing any new operations from blocking on the range to limit the blast radius of the unavailability. This is motivated by a recent real-world outage in which a single range's unavailability caused transactions accessing the unavailable range to block, and in turn keep its intents elsewhere open indefinitely, which caused cascading contention which was difficult to get under control even after the user had taken steps to remove writes to the table from the app.

lunevalex commented 3 years ago

Now that we have a draft prototype the next step is to create a more robust test that covers the scenario that motivated this work.

From: @tbg

The setup could be done via testcluster, and basically it's: create two tables, accounts and account_history workload does insert into accounts and insert into account_history (maybe we can find a way to make this less synthetic, i.e. motivate why we'd write to both in that order) account_history goes down (just use manual replication and stop a node, or maybe even better, stall the apply loop as you're doing in the PR right now) should see the failfast error starting to pop up from the workload use SQL to switch out the table account history (rename the old table, recreate the new table) workload should make progress again. There's an advanced version of this, where the txn first writes to account_history (and is thus anchored there), in that case the test can't pass reliably, as a txn with intents on both tables won't be abortable, so it will block access to its row on accounts as well.

tbg commented 3 years ago

This issue should also address #60617 (now closed and folded into this one):

We currently repropose inflight entries if they haven't popped out of the raft log within a "reasonable amount of time" (X ticks). Under overload conditions, this can lead to adding more work to an already overloaded system.

When we circuit-break, we should enforce emptiness of the proposals map (other than probes we send to check for recovery).

irfansharif commented 3 years ago

https://github.com/cockroachlabs/support/issues/939 is an example where internal circuit breakers to downed replicas would've prevented an outage. We hadn't lost quorum there; the follower replicas requesting leases took upwards of minutes to acquire it (something now made better with https://github.com/cockroachdb/cockroach/pull/55148; which fixed https://github.com/cockroachdb/cockroach/issues/37906). Because the inserts could now take minutes to run (the customer didn't make use of statement timeouts), it presented as an outage.

mwang1026 commented 3 years ago

From one of our customers on a recent call:

"Ultimately, if a query will fail because of a range is unavailable, they believe this is information that should be passed up to the app so the app can make the right decision. There is a difference between a statement timing out because it took too long (i.e. contention) and should be tried again, vs a range is unavailable and no amount of re-tries will succeed. So they are objecting to bundling these two different failures into the same timeout. And also that queries should wait for a range that the cluster knows is not available."

Specifically, the error that the application receives should be clear why. As application developer, it's important to if I get a timeout, that I know "timeout and range XYZ is unavailable" or "timeout and node 1,2,3 are down"

tbg commented 2 years ago

(moved to new comment at the bottom)

tbg commented 2 years ago

@mwang1026 I think you wanted to reach out to the home teams re:

74713

74502

edit: also this one https://github.com/cockroachdb/cockroach/issues/75068

tbg commented 2 years ago

Moving earlier checklist comment here, for better visibility.

22.1:

[x] #72972
[x] #73641
[x] #71806
[x] #74711 (PR: #74810)
[x] #74500 (PR: #74698)
[x] #74707
[x] #74705
[x] #74505
[x] #74504 (deferred to later)
[x] https://github.com/cockroachdb/cockroach/pull/76147 (the test works, just need to merge it)
[x] https://github.com/cockroachdb/cockroach/issues/74799
[x] (external) #74713
[x] (external) #75068

Other issues, either related to per-replica circuit breakers or to the problem area of handling "slow"/unservable requests across the stack, but they're not currently in scope:

(external) #74502
74799
74714
74712
74701
74699
74616
74509
74503

tbg commented 2 years ago

Replica circuit breakers are enabled on master and release-22.1. There are follow-up issues filed for future extensions, see above.

cockroachdb / cockroach

kvserver: circuit-break requests to unavailable ranges #33007

74713

74502

74799

74714

74712

74701

74699

74616

74509

74503