cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.15k stars 3.81k forks source link

roachtest: splits/largerange/size=32GiB,nodes=6 failed [raft snaps; needs #106813] #104588

Closed cockroach-teamcity closed 1 year ago

cockroach-teamcity commented 1 year ago

roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on master @ c7a7e423506c88f655c08d6ba6acd6e907e1dae5:

test artifacts and logs in: /artifacts/splits/largerange/size=32GiB_nodes=6/run_1
(sql_runner.go:203).Query: error executing 'SHOW RANGES FROM TABLE bank.bank': pq: replica unavailable: (n3,s3):5 unable to serve request to r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n4,s4):2, (n2,s2):3, (n6,s6):4, (n3,s3):5, next=7, gen=10]: closed timestamp: 1686231735.000116914,0 (2023-06-08 13:42:15); raft status: {"id":"5","term":8,"vote":"5","commit":6636,"lead":"5","raftState":"StateLeader","applied":6636,"progress":{"2":{"match":6636,"next":6637,"state":"StateReplicate"},"3":{"match":6636,"next":6637,"state":"StateReplicate"},"4":{"match":6636,"next":6637,"state":"StateReplicate"},"5":{"match":6636,"next":6637,"state":"StateReplicate"},"1":{"match":6636,"next":6637,"state":"StateReplicate"}},"leadtransferee":"0"}: encountered poisoned latch /Local/RangeID/548/r/AbortSpan/"97e8d7ec-2a38-4880-9df6-3b51abc9086c"@0,0
(monitor.go:137).Wait: monitor failure: monitor task failed: t.Fatal() was called

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-28616

pav-kv commented 1 year ago
encountered poisoned latch /Local/RangeID/548/r/AbortSpan/"97e8d7ec-2a38-4880-9df6-3b51abc9086c"@0,0

while querying r1

pav-kv commented 1 year ago

Could be an instance of #102936

pav-kv commented 1 year ago

First error mentioning r548 is a circuit breaker tripped while n3 is still in a demoting state:

13:41:43.916584 kv/kvserver/replica_circuit_breaker.go:175 ⋮ [T1,n3,s3,r548/2:‹/Table/106/1/119{1187…-3768…}›] breaker: tripped with error: replica unavailable: (n3,s3):2VOTER_DEMOTING_LEARNER unable to serve request to r548:‹/Table/106/1/119{118742-376853}› [(n1,s1):1, (n3,s3):2VOTER_DEMOTING_LEARNER, (n4,s4):3, (n2,s2):4VOTER_INCOMING, next=5, gen=475]: closed timestamp: 1686231639.296513274,0 (2023-06-08 13:40:39); raft status: {"id":"2","term":7,"vote":"2","commit":29,"lead":"2","raftState":"StateLeader","applied":29,"progress":{"1":{"match":0,"next":1,"state":"StateSnapshot"},"2":{"match":50,"next":51,"state":"StateReplicate"},"3":{"match":50,"next":52,"state":"StateReplicate"},"4":{"match":0,"next":1,"state":"StateSnapshot"}},"leadtransferee":"0"}: have been waiting 62.00s for slow proposal ConditionalPut [/Local/Range/Table/106/1/‹119118742›/‹RangeDescriptor›,/Min), [txn: 97e8d7ec]

Exactly the same poisoned latch (key ‹/Local/RangeID/548/r/AbortSpan/"97e8d7ec-2a38-4880-9df6-3b51abc9086c"›@0,0) pops up in logs a few times shortly after:

13:41:43.930897 kv/kvserver/reports/reporter.go:159 ⋮ [T1,n3,replication-reporter] failed to generate replication reports: failed to compute constraint conformance report: <same latch>
13:41:43.931478 kv/txn.go:878 ⋮ [T1,n4] async rollback failed: <same latch>: "change-replica" meta={id=97e8d7ec key=/Local/Range/Table/106/1/‹119118742›/‹RangeDescriptor› iso=Serializable pri=0.03170581 epo=0 ts=1686231642.299152484,0 min=1686231642.299152484,0 seq=1} lock=true stat=PENDING rts=1686231642.299152484,0 wto=false gul=1686231642.799152484,0

Between 13:41:44-13:42:15, there are a few "poisoned latch" errors with the same key logged on nodes 1, 2, 4, 5 (from line kv/kvserver/queue.go:1146):

[T1,n4,replicaGC,s4,r155/3:‹/Table/106/1/21{294673-552784}›]
[T1,n5,replicaGC,s5,r155/5:‹/Table/106/1/21{294673-552784}›]
[T1,n1,replicaGC,s1,r3/1:‹/System/{NodeLive…-tsd}›]
[T1,n2,replicaGC,s2,r514/4:‹/Table/106/1/110{3429…-6010…}›]
[T1,n5,replicaGC,s5,r163/5:‹/Table/106/1/23{359561-617672}›]
[T1,n2,replicaGC,s2,r502/4:‹/Table/106/1/107{2456…-5037…}›]
[T1,n1,replicaGC,s1,r182/1:‹/Table/106/1/28{263670-521781}›]
[T1,n5,replicaGC,s5,r85/5:‹/Table/106/1/3{121855-381967}›]
[T1,n2,replicaGC,s2,r199/5:‹/Table/106/1/32{651557-909668}›]
tbg commented 1 year ago

Salvaged the artifacts, https://drive.google.com/file/d/16bkGUEoTaTMg9XeAXOjX8pfsHebxFS7B/view?usp=share_link

erikgrinaker commented 1 year ago

@tbg says:

It's weird to me that we're seeing someone take out that latch for range=548 but while on r1. Maybe some weird interaction where the r548 got merged into r1 while the latch was held, but how would that work (merges acquire that latch too)?

@erikgrinaker says:

What if it was synchronously resolving an intent it found on r1, which was written by txn 97e8d7ec anchored on r548. If the synchronous intent resolution hit a poisoned latch on the abort span on r548 (which it would because that txn was the one that tripped the breaker and got aborted), the poisoned latch error might have been propagated back up to the caller and reported as a poisoned latch on r1.

pav-kv commented 1 year ago

Contention events log has something to say about the aborted transaction 97e8d7ec:

debug/nodes/1/crdb_internal.node_contention_events.json:

{
   "count" : "1",
   "cumulative_contention_time" : "00:00:00.250195",
   "index_id" : "106",
   "key" : "\\x03f289f9071d8bd5",
   "num_contention_events" : "2",
   "table_id" : "4294967163",
   "txn_id" : "97e8d7ec-2a38-4880-9df6-3b51abc9086c"
}

debug/crdb_internal.transaction_contention_events.json:

{
   "blocking_txn_fingerprint_id" : "\\x0000000000000000",
   "blocking_txn_id" : "97e8d7ec-2a38-4880-9df6-3b51abc9086c",
   "collection_ts" : "2023-06-08 13:42:19.554013+00",
   "contending_key" : "\\x03f289f9071d8bd5",
   "contending_pretty_key" : "/Meta2/Table/106/1/119376853",
   "contention_duration" : "00:00:00.052876",
   "database_name" : "",
   "index_name" : "",
   "schema_name" : "",
   "table_name" : "",
   "waiting_stmt_fingerprint_id" : "\\x66de6ed8999bbf0e",
   "waiting_stmt_id" : "1766b2d540926e3c0000000000000001",
   "waiting_txn_fingerprint_id" : "\\x0000000000000000",
   "waiting_txn_id" : "55eed750-2e00-4680-b85b-d045eb9d45fb"
}

Execution insights on the contending transaction 55eed750 (looks like a failed consistency check):

{
   "app_name" : "",
   "causes" : "{}",
   "contention" : "00:00:00",
   "cpu_sql_nanos" : "6072049",
   "database_name" : "defaultdb",
   "end_time" : "2023-06-08 13:42:19.556586",
   "error_code" : "XXUUU",
   "exec_node_ids" : "{1}",
   "full_scan" : "f",
   "implicit_txn" : "t",
   "index_recommendations" : "{}",
   "last_retry_reason" : "NULL",
   "plan_gist" : "AgICABoMAwUIBgg=",
   "priority" : "normal",
   "problem" : "FailedExecution",
   "query" : "SELECT t.range_id, t.start_key_pretty, t.status, t.detail FROM ROWS FROM (crdb_internal.check_consistency(_, '_', '_')) AS t WHERE t.status NOT IN ('_', '_')",
   "retries" : "0",
   "rows_read" : "0",
   "rows_written" : "0",
   "session_id" : "1766b2d53cfd91a80000000000000001",
   "start_time" : "2023-06-08 13:42:19.480564",
   "status" : "Failed",
   "stmt_fingerprint_id" : "\\x66de6ed8999bbf0e",
   "stmt_id" : "1766b2d540926e3c0000000000000001",
   "txn_fingerprint_id" : "\\x3ff3476808ae6c64",
   "txn_id" : "55eed750-2e00-4680-b85b-d045eb9d45fb",
   "user_name" : "root"
}
pav-kv commented 1 year ago

r548 circuit breaker tripped with the following error:

"circuit_breaker_error": "replica unavailable: (n3,s3):2VOTER_DEMOTING_LEARNER unable to serve request to r548:/Table/106/1/119{118742-376853} [(n1,s1):1, (n3,s3):2VOTER_DEMOTING_LEARNER, (n4,s4):3, (n2,s2):4VOTER_INCOMING, next=5, gen=475]: closed timestamp: 1686231639.296513274,0 (2023-06-08 13:40:39); raft status: {\"id\":\"2\",\"term\":7,\"vote\":\"2\",\"commit\":29,\"lead\":\"2\",\"raftState\":\"StateLeader\",\"applied\":29,\"progress\":{\"1\":{\"match\":0,\"next\":1,\"state\":\"StateSnapshot\"},\"2\":{\"match\":109,\"next\":110,\"state\":\"StateReplicate\"},\"3\":{\"match\":109,\"next\":113,\"state\":\"StateReplicate\"},\"4\":{\"match\":0,\"next\":1,\"state\":\"StateSnapshot\"}},\"leadtransferee\":\"0\"}: operation \"probe\" timed out after 1m0.002s (given timeout 1m0s): result is ambiguous: after 60.00s of attempting command: context deadline exceeded"`

More details on r548 on n3:

Details ```json "span": { "start_key": "/Table/106/1/119118742", "end_key": "/Table/106/1/119376853" }, "raft_state": { "replica_id": 2, "hard_state": { "term": 7, "vote": 2, "commit": 29 }, "lead": 2, "state": "StateLeader", "applied": 29, "progress": { "1": { "next": 1, "state": "StateSnapshot", "paused": true, "pending_snapshot": 29 }, "2": { "match": 161, "next": 162, "state": "StateReplicate" }, "3": { "match": 161, "next": 162, "state": "StateReplicate" }, "4": { "next": 1, "state": "StateSnapshot", "paused": true, "pending_snapshot": 26 } } }, "state": { "state": { "raft_applied_index": 29, "lease_applied_index": 9, "desc": { "range_id": 548, "start_key": "8on5Bxmblg==", "end_key": "8on5Bx2L1Q==", "internal_replicas": [ { "node_id": 1, "store_id": 1, "replica_id": 1, "type": 0 }, { "node_id": 3, "store_id": 3, "replica_id": 2, "type": 4 }, { "node_id": 4, "store_id": 4, "replica_id": 3, "type": 0 }, { "node_id": 2, "store_id": 2, "replica_id": 4, "type": 2 } ], "next_replica_id": 5, "generation": 475, "sticky_bit": {} }, "lease": { "start": { "wall_time": 1686231642153434301 }, "expiration": { "wall_time": 1686231648153352643 }, "replica": { "node_id": 3, "store_id": 3, "replica_id": 2, "type": 0 }, "proposed_ts": { "wall_time": 1686231642153352643 }, "sequence": 5, "acquisition_type": 1 }, "truncated_state": { "index": 10, "term": 5 }, "gc_threshold": {}, "stats": { "contains_estimates": 0, "last_update_nanos": 1686231615586276112, "intent_age": 0, "gc_bytes_age": 0, "live_bytes": 33554430, "live_count": 258111, "key_bytes": 5420331, "key_count": 258111, "val_bytes": 28134099, "val_count": 258111, "intent_bytes": 0, "intent_count": 0, "separated_intent_count": 0, "range_key_count": 0, "range_key_bytes": 0, "range_val_count": 0, "range_val_bytes": 0, "sys_bytes": 866, "sys_count": 8, "abort_span_bytes": 0 }, "version": { "major": 23, "minor": 1, "patch": 0, "internal": 8 }, "raft_closed_timestamp": { "wall_time": 1686231639296513274 }, "raft_applied_index_term": 7, "gc_hint": { "latest_range_delete_timestamp": {} } }, "last_index": 161, "num_pending": 4, "raft_log_size": 43070, "approximate_proposal_quota": 8388183, "proposal_quota_base_index": 29, "range_max_bytes": 67108864, "active_closed_timestamp": { "wall_time": 1686231639296513274 }, "tenant_id": 1, "closed_timestamp_sidetransport_info": { "replica_closed": { "wall_time": 1686231639279698235 }, "replica_lai": 7, "central_closed": {} }, "circuit_breaker_error": "replica unavailable: (n3,s3):2VOTER_DEMOTING_LEARNER unable to serve request to r548:/Table/106/1/119{118742-376853} [(n1,s1):1, (n3,s3):2VOTER_DEMOTING_LEARNER, (n4,s4):3, (n2,s2):4VOTER_INCOMING, next=5, gen=475]: closed timestamp: 1686231639.296513274,0 (2023-06-08 13:40:39); raft status: {\"id\":\"2\",\"term\":7,\"vote\":\"2\",\"commit\":29,\"lead\":\"2\",\"raftState\":\"StateLeader\",\"applied\":29,\"progress\":{\"1\":{\"match\":0,\"next\":1,\"state\":\"StateSnapshot\"},\"2\":{\"match\":109,\"next\":110,\"state\":\"StateReplicate\"},\"3\":{\"match\":109,\"next\":113,\"state\":\"StateReplicate\"},\"4\":{\"match\":0,\"next\":1,\"state\":\"StateSnapshot\"}},\"leadtransferee\":\"0\"}: operation \"probe\" timed out after 1m0.002s (given timeout 1m0s): result is ambiguous: after 60.00s of attempting command: context deadline exceeded" } ```
Replica Type State Node
1 `VOTER_FULL` `StateSnapshot` @ 29 1
2 (leader) `VOTER_DEMOTING_LEARNER` `StateReplicate` @ 161 3
3 `VOTER_FULL` `StateReplicate` @ 161 4
4 `VOTER_INCOMING` `StateSnapshot` @ 26 2

We see that the incoming configuration (replicas 1, 3, 4) has two replicas in StateSnapshot, so we're unable to append/commit to it yet, until the snapshots go through. Notably, this is a "large range" test, so the snapshots can be big and slow to get through.

pav-kv commented 1 year ago
# INITIAL snapshot at index 25 arrived to replica 4 from replica 2
I230608 13:40:42.240333 34335 13@kv/kvserver/store_snapshot.go:1591 ⋮ [T1,n4,s4,r548/3:‹/Table/106/1/119{1187…-3768…}›] 739  streamed INITIAL snapshot 75ebb1f1 at applied index 25 to (n2,s2):4LEARNER with 32 MiB in 0.39s @ 81 MiB/s: kvs=258120 rangeKVs=0, rate-limit: 512 MiB/s, queued: 0.00s
# A config change promoting replica 4 to voter is being proposed
# likely at index 26
I230608 13:40:42.288917 34370 13@kv/kvserver/replica_command.go:2349 ⋮ [T1,n4,replicate,s4,r548/3:‹/Table/106/1/119{1187…-3768…}›] 740  change replicas (add [(n2,s2):4VOTER_INCOMING] remove [(n3,s3):2VOTER_DEMOTING_LEARNER]): existing descriptor r548:‹/Table/106/1/119{118742-376853}› [(n1,s1):1, (n3,s3):2, (n4,s4):3, (n2,s2):4LEARNER, next=5, gen=474]
I230608 13:40:42.300814 34370 13@kv/kvserver/replica_command.go:2349 ⋮ [T1,n4,replicate,s4,r548/3:‹/Table/106/1/119{1187…-3768…}›] 741  change replicas (add [] remove []): existing descriptor r548:‹/Table/106/1/119{118742-376853}› [(n1,s1):1, (n3,s3):2VOTER_DEMOTING_LEARNER, (n4,s4):3, (n2,s2):4VOTER_INCOMING, next=5, gen=475]

# could not acquire latches
W230608 13:40:57.310779 [n3,r548/2:‹/Table/106/1/119{1187…-3768…}›] have been waiting 15s to acquire read latch ‹/Local/Range/Table/106/1/119118742/RangeDescriptor›@0,0, held by write latch ‹/Local/Range/Table/106/1/119118742/RangeDescriptor›@0,0
W230608 13:40:57.741162 [n3,r548/2:‹/Table/106/1/119{1187…-3768…}›] have been waiting 15s to acquire write latch ‹/{Local-Meta1/}›@0,0, held by write latch ‹/Local/RangeID/548/r/AbortSpan/"97e8d7ec-2a38-4880-9df6-3b51abc9086c"›@0,0
W230608 13:40:57.741162 [n3,r548/2:‹/Table/106/1/119{1187…-3768…}›] have been waiting 15s to acquire write latch ‹/Local/RangeID/548/r/AbortSpan/"97e8d7ec-2a38-4880-9df6-3b51abc9086c"›@0,0, held by read latch ‹/Local/RangeID/548/r/AbortSpan/"97e8d7ec-2a38-4880-9df6-3b51abc9086c"›@0,0
W230608 13:40:57.802914 [n3,r548/2:‹/Table/106/1/119{1187…-3768…}›] have been waiting 15s to acquire write latch ‹/Local/RangeID/548/r/AbortSpan/"97e8d7ec-2a38-4880-9df6-3b51abc9086c"›@0,0, held by write latch ‹/{Local-Meta1/}›@0,0
W230608 13:40:57.852312 [n3,r548/2:‹/Table/106/1/119{1187…-3768…}›] have been waiting 15s to acquire read latch ‹/Local/Range/Table/106/1/119118742/RangeDescriptor›@0,0, held by write latch ‹/{Local-Meta1/}›@0,0
W230608 13:40:58.303021 [n3,r548/2:‹/Table/106/1/119{1187…-3768…}›] have been waiting 15s to acquire read latch ‹/Local/RangeID/548/r/AbortSpan/"97e8d7ec-2a38-4880-9df6-3b51abc9086c"›@0,0, held by write latch ‹/{Local-Meta1/}›@0,0
W230608 13:40:58.309024 [n3,r548/2:‹/Table/106/1/119{1187…-3768…}›] have been waiting 15s to acquire read latch ‹/Local/Range/Table/106/1/119118742/RangeDescriptor›@0,0, held by write latch ‹/{Local-Meta1/}›@0,0
W230608 13:41:21.518763 [n3,r548/2:‹/Table/106/1/119{1187…-3768…}›] have been waiting 15s to acquire write latch ‹/Local/RangeID/548/r/AbortSpan/"97e8d7ec-2a38-4880-9df6-3b51abc9086c"›@0,0, held by write latch ‹/{Local-Meta1/}›@0,0
W230608 13:41:27.741000 [n3,r548/2:‹/Table/106/1/119{1187…-3768…}›] have been waiting 15s to acquire write latch ‹/{Local-Meta1/}›@0,0, held by write latch ‹/Local/Range/Table/106/1/119118742/RangeDescriptor›@0,0

# 1 minute later the proposal reports a timeout
I230608 13:41:41.838878 34370 13@kv/kvserver/replicate_queue.go:751 ⋮ [T1,n4,replicate,s4,r548/3:‹/Table/106/1/119{1187…-3768…}›] 778  error processing replica: change replicas of r548 failed: txn exec: context deadline exceeded
E230608 13:41:41.838985 34370 kv/kvserver/queue.go:1146 ⋮ [T1,n4,replicate,s4,r548/3:‹/Table/106/1/119{1187…-3768…}›] 423  operation ‹"replicate queue process replica 548"› timed out after 1m0.002s (given timeout 1m0s): change replicas of r548 failed: txn exec: context deadline exceeded

# 2 seconds later the poisoned latch is discovered
I230608 13:41:43.931478 38260 kv/txn.go:878 ⋮ [T1,n4] 426  async rollback failed: replica unavailable: (n3,s3):2VOTER_DEMOTING_LEARNER unable to serve request to r548:‹/Table/106/1/119{118742-376853}› [(n1,s1):1, (n3,s3):2VOTER_DEMOTING_LEARNER, (n4,s4):3, (n2,s2):4VOTER_INCOMING, next=5, gen=475]: closed timestamp: 1686231639.296513274,0 (2023-06-08 13:40:39); raft status: {"id":"2","term":7,"vote":"2","commit":29,"lead":"2","raftState":"StateLeader","applied":29,"progress":{"1":{"match":0,"next":1,"state":"StateSnapshot"},"2":{"match":52,"next":53,"state":"StateReplicate"},"3":{"match":52,"next":53,"state":"StateReplicate"},"4":{"match":0,"next":1,"state":"StateSnapshot"}},"leadtransferee":"0"}: encountered poisoned latch ‹/Local/Range/Table/106/1/119118742/RangeDescriptor›@0,0: "change-replica" meta={id=97e8d7ec key=/Local/Range/Table/106/1/‹119118742›/‹RangeDescriptor› iso=Serializable pri=0.03170581 epo=0 ts=1686231642.299152484,0 min=1686231642.299152484,0 seq=1} lock=true stat=PENDING rts=1686231642.299152484,0 wto=false gul=1686231642.799152484,0
pav-kv commented 1 year ago

There is a build-up in the snapshot queue on n3. The second snapshot@26 could be blocked by sending other snapshots, because the number of pending snapshots at 13:41:30 is still high and the receiver node process them 1 at a time.

Screenshot 2023-06-29 at 16 33 27 Screenshot 2023-06-30 at 11 09 17
pav-kv commented 1 year ago
$ cockroach debug merge-logs --filter="initiating a split" --format=crdb-v1 *.unredacted/*log
...

# The range is initially 16 GiB, when at 13:37:24 it got reconfigured for a smaller
# range size. It started splitting. The splits happen sequentially, chopping off 64 MiB
# from the left each time. The new RHS is being split again, etc.
I230608 13:37:24.016304 21504 kv/kvserver/replica_command.go:436 ⋮ [T1,n3,split,s3,r64/2:‹/{Table/106-Max}›] 162  initiating a split of this range at key /Table/106/1/‹260623› [r74] (‹16 GiB above threshold size 64 MiB›)‹; r64/6 is being probed (may or may not need a Raft snapshot)›
I230608 13:37:24.484801 21598 kv/kvserver/replica_command.go:436 ⋮ [T1,n3,split,s3,r74/2:‹/{Table/106/1/…-Max}›] 166  initiating a split of this range at key /Table/106/1/‹520735› [r75] (‹16 GiB above threshold size 64 MiB›)‹; r74/4 is waiting for a Raft snapshot›
I230608 13:37:24.938527 21659 kv/kvserver/replica_command.go:436 ⋮ [T1,n3,split,s3,r75/2:‹/{Table/106/1/…-Max}›] 171  initiating a split of this range at key /Table/106/1/‹780847› [r76] (‹16 GiB above threshold size 64 MiB›)‹; r75/4 is waiting for a Raft snapshot›
I230608 13:37:25.365611 21859 kv/kvserver/replica_command.go:436 ⋮ [T1,n3,split,s3,r76/2:‹/{Table/106/1/…-Max}›] 174  initiating a split of this range at key /Table/106/1/‹1040959› [r77] (‹16 GiB above threshold size 64 MiB›)‹; r76/4 is waiting for a Raft snapshot›

# The above splits print a "hint" saying something like
#  "r76/4 is waiting for a Raft snapshot"
# After a while, the hint started printing two replicas:
#  "r128/4 is waiting for a Raft snapshot; r128/1 is being probed (may or may not need a Raft snapshot"

I230608 13:37:41.251934 23611 kv/kvserver/replica_command.go:436 ⋮ [T1,n3,split,s3,r128/2:‹/{Table/106/1/…-Max}›] 231  initiating a split of this range at key /Table/106/1/‹14566783› [r129] (‹14 GiB above threshold size 64 MiB›)‹; r128/4 is waiting for a Raft snapshot; r128/1 is being probed (may or may not need a Raft snapshot)›
I230608 13:37:41.562954 23687 kv/kvserver/replica_command.go:436 ⋮ [T1,n3,split,s3,r129/2:‹/{Table/106/1/…-Max}›] 232  initiating a split of this range at key /Table/106/1/‹14826895› [r130] (‹14 GiB above threshold size 64 MiB›)‹; r129/1 is being probed (may or may not need a Raft snapshot); r129/4 is waiting for a Raft snapshot›
I230608 13:37:41.853105 23740 kv/kvserver/replica_command.go:436 ⋮ [T1,n3,split,s3,r130/2:‹/{Table/106/1/…-Max}›] 233  initiating a split of this range at key /Table/106/1/‹15087007› [r131] (‹14 GiB above threshold size 64 MiB›)‹; r130/1 is being probed (may or may not need a Raft snapshot); r130/4 is waiting for a Raft snapshot›
I230608 13:37:42.137711 23729 kv/kvserver/replica_command.go:436 ⋮ [T1,n3,split,s3,r131/2:‹/{Table/106/1/…-Max}›] 234  initiating a split of this range at key /Table/106/1/‹15347119› [r132] (‹14 GiB above threshold size 64 MiB›)‹; r131/1 is being probed (may or may not need a Raft snapshot); r131/4 is waiting for a Raft snapshot›

# There are also occurrences of
#  "r463/4 is being probed (may or may not need a Raft snapshot); r463/1 is being probed (may or may not need a Raft snapshot)"
# and
#  "r470/1 is waiting for a Raft snapshot; r470/4 is waiting for a Raft snapshot"

# At 13:39:53, splits started happening on node 4 (before that they all were on node 3).
I230608 13:39:53.548165 26443 kv/kvserver/replica_command.go:436 ⋮ [T1,n4,split,s4,r470/3:‹/{Table/106/1/…-Max}›] 177  initiating a split of this range at key /Table/106/1/‹102857749› [r484] (‹3.6 GiB above threshold size 64 MiB›)‹; r470/1 is waiting for a Raft snapshot; r470/4 is waiting for a Raft snapshot›
I230608 13:39:53.999167 26432 kv/kvserver/replica_command.go:436 ⋮ [T1,n4,split,s4,r484/3:‹/{Table/106/1/…-Max}›] 179  initiating a split of this range at key /Table/106/1/‹103115860› [r485] (‹3.5 GiB above threshold size 64 MiB›)‹; r484/1 is waiting for a Raft snapshot›
I230608 13:39:54.162182 26457 kv/kvserver/replica_command.go:436 ⋮ [T1,n4,split,s4,r484/3:‹/{Table/106/1/…-Max}›] 181  initiating a split of this range at key /Table/106/1/‹103115860› [r486] (‹3.5 GiB above threshold size 64 MiB›)‹; r484/1 is waiting for a Raft snapshot›
I230608 13:39:54.577882 26549 kv/kvserver/replica_command.go:436 ⋮ [T1,n4,split,s4,r486/3:‹/{Table/106/1/…-Max}›] 182  initiating a split of this range at key /Table/106/1/‹103373971› [r487] (‹3.5 GiB above threshold size 64 MiB›)‹; r486/1 is waiting for a Raft snapshot›
I230608 13:39:54.951504 26604 kv/kvserver/replica_command.go:436 ⋮ [T1,n4,split,s4,r487/3:‹/{Table/106/1/…-Max}›] 184  initiating a split of this range at key /Table/106/1/‹103632082› [r488] (‹3.5 GiB above threshold size 64 MiB›)‹; r487/1 is waiting for a Raft snapshot›

# All the way until the last split at 13:40:27.
I230608 13:40:27.662935 32236 kv/kvserver/replica_command.go:436 ⋮ [T1,n4,split,s4,r596/3:‹/{Table/106/1/…-Max}›] 373  initiating a split of this range at key /Table/106/1/‹131766181› [r597] (‹80 MiB above threshold size 64 MiB›)‹; r596/1 is waiting for a Raft snapshot›
pav-kv commented 1 year ago

There is a large 3.6 GiB snapshot for r470 that node 4 was streaming to node 1 for 38.62s (from ~13:39:47 till 13:40:25):

I230608 13:40:25.460219 25275 13@kv/kvserver/store_snapshot.go:1591 ⋮ [T1,n4,s4,r470/3:‹/Table/106/1/102{5996…-8577…}›] 471  streamed VIA_SNAPSHOT_QUEUE snapshot 6a8ead93 at applied index 10 to (n1,s1):1 with 3.6 GiB in 38.62s @ 95 MiB/s: kvs=29553207 rangeKVs=0, rate-limit: 512 MiB/s, queued: 0.40s
I230608 13:40:28.088780 357748 13@kv/kvserver/replica_raftstorage.go:503 ⋮ [T1,n1,s1,r470/1:{-}] 3218  applying VIA_SNAPSHOT_QUEUE snapshot 6a8ead93 from (n4,s4):3 at applied index 10
I230608 13:40:28.103633 357748 kv/kvserver/replica_raftstorage.go:514 ⋮ [T1,n1,s1,r470/1:‹/{Table/106/1/…-Max}›] 1060  applied VIA_SNAPSHOT_QUEUE snapshot 6a8ead93 from (n4,s4):3 at applied index 10 (total=15ms data=3.2 GiB ingestion=6@13ms)

This was blocking all other snapshots incoming to node 1, which we see in the logs by filtering for "applied .* snapshot" messages:

# steady pace
I230608 13:39:46.429279 357687 kv/kvserver/replica_raftstorage.go:514 ⋮ [T1,n1,s1,r213/1:‹/Table/106/1/36{265111-523222}›] 1054  applied VIA_SNAPSHOT_QUEUE snapsho
t 7fc2365c from (n3,s3):2 at applied index 48 (total=4ms data=29 MiB ingestion=6@2ms)
I230608 13:39:46.834671 357741 kv/kvserver/replica_raftstorage.go:514 ⋮ [T1,n1,s1,r214/1:‹/Table/106/1/36{523222-781333}›] 1055  applied VIA_SNAPSHOT_QUEUE snapsho
t 3edf8a9a from (n3,s3):2 at applied index 47 (total=7ms data=29 MiB ingestion=6@4ms)

# big gap in timestamps
I230608 13:40:28.103633 357748 kv/kvserver/replica_raftstorage.go:514 ⋮ [T1,n1,s1,r470/1:‹/{Table/106/1/…-Max}›] 1060  applied VIA_SNAPSHOT_QUEUE snapshot 6a8ead93 from (n4,s4):3 at applied index 10 (total=15ms data=3.2 GiB ingestion=6@13ms)

# steady pace again
I230608 13:40:28.570132 360929 kv/kvserver/replica_raftstorage.go:514 ⋮ [T1,n1,s1,r216/1:‹/Table/106/1/37{039444-297555}›] 1064  applied VIA_SNAPSHOT_QUEUE snapshot 66a15d5d from (n3,s3):2 at applied index 55 (total=32ms data=29 MiB ingestion=6@17ms)
I230608 13:40:28.953381 362898 kv/kvserver/replica_raftstorage.go:514 ⋮ [T1,n1,s1,r217/1:‹/Table/106/1/37{297555-555666}›] 1070  applied VIA_SNAPSHOT_QUEUE snapshot a5f938e9 from (n3,s3):2 at applied index 60 (total=6ms data=29 MiB ingestion=6@4ms)
pav-kv commented 1 year ago

This r470 range is the first one that node 4 split [see the log above]. Before that splits were driven by node 3. Likely there was a shift in leadership/leaseholder.

pav-kv commented 1 year ago

We're likely seeing the cascading effect described in https://github.com/cockroachdb/cockroach/blob/87d6547dba9451e021e5a8872773fb821a5ea7cc/pkg/kv/kvserver/split_delay_helper.go#L109-L148

Notably, though, splits driven by the split queue are not "delayable" (see here and the other 2 invocations below it), i.e. this workaround is not active.

We should probably enable these delays.

tbg commented 1 year ago

Synopsis:

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/replication

tbg commented 1 year ago

Going through the logs again just to button things up.

The initial large range is r64.

First, r64 gets hit by the randomized backup in the roachtest. Oops! The range is 16GiB at this point. We apply a 5min timeout to this export somewhere. So if you ever have a large range like this, backups are not going to work[^1]

[^1]: filed https://github.com/cockroachdb/cockroach/issues/107519.

Details ``` W230608 13:24:12.697833 142988 kv/kvclient/kvcoord/dist_sender.go:1710 ⋮ [T1,n1,job=872152117891235841] 614 slow range RPC: have been waiting 533.30s (1 attempts) for RPC Export [/Table/106/1,/Table/106/2), [wait-policy: Error] to r64:‹/{Table/106-Max}› [(n1,s1):1, next=2, gen=1]; resp: ‹(err: context deadline exceeded)› W230608 13:24:12.698705 142988 kv/kvclient/kvcoord/dist_sender.go:1720 ⋮ [T1,n1,job=872152117891235841] 615 slow RPC response: slow RPC finished after 533.36s (1 attempts) E230608 13:24:12.698774 142988 ccl/backupccl/backup_processor.go:498 ⋮ [T1,n1,job=872152117891235841] 616 failed export request for span ‹/Table/106/{1-2}› E230608 13:24:12.698774 142988 ccl/backupccl/backup_processor.go:498 ⋮ [T1,n1,job=872152117891235841] 616 + trace: E230608 13:24:12.698774 142988 ccl/backupccl/backup_processor.go:498 ⋮ [T1,n1,job=872152117891235841] 616 + 0.000ms 0.000ms === operation:backupccl.ExportRequest E230608 13:24:12.698774 142988 ccl/backupccl/backup_processor.go:498 ⋮ [T1,n1,job=872152117891235841] 616 + 0.000ms 0.000ms [dist sender send: {count: 1, duration 8m53s}] E230608 13:24:12.698774 142988 ccl/backupccl/backup_processor.go:498 ⋮ [T1,n1,job=872152117891235841] 616 + 0.000ms 0.000ms [/cockroach.roachpb.Internal/Batch: {count: 1, duration 8m53s}] E230608 13:24:12.698774 142988 ccl/backupccl/backup_processor.go:498 ⋮ [T1,n1,job=872152117891235841] 616 + 0.000ms 0.000ms [evalExport: {count: 1, duration 8m53s}] E230608 13:24:12.698774 142988 ccl/backupccl/backup_processor.go:498 ⋮ [T1,n1,job=872152117891235841] 616 + 0.000ms 0.000ms [storage.MVCCExportToSST: {count: 1, duration 8m53s}] W230608 13:24:12.706125 142984 ccl/backupccl/backup_processor.go:372 ⋮ [T1,n1,job=872152117891235841] 617 failed to flush SST sink: context canceled W230608 13:24:12.722142 142984 ccl/backupccl/backup_processor.go:662 ⋮ [T1,n1,job=872152117891235841] 618 failed to close SST sink: ‹context canceled› E230608 13:24:12.779877 142171 jobs/registry.go:1599 ⋮ [T1,n1] 619 BACKUP job 872152117891235841: stepping through state reverting with error: failed to run backup: exporting 16 ranges: export request timeout: operation ‹"ExportRequest for span /Table/106/{1-2}"› timed out after 8m53.36s (given timeout 5m0s): context deadline exceeded ```

First, r64 successfully adds a replica on n3 (the snapshot takes ~260s and streams at ~63MB/s, i.e. right at the max configured rate). Next, n1 adds a learner, but then that learner gets removed by n3, which apparently now holds the lease. We can infer from that that there's likely still a LEARNER snapshot inflight from n1->n3 in what follows below. n3 then tries to add its own LEARNER to n2 instead (replID 4):

I230608 13:32:39.491927 11658 13@kv/kvserver/replica_raft.go:387 ⋮ [T1,n3,s3,r64/2:‹/{Table/106-Max}›] 26 proposing SIMPLE(l4) [(n2,s2):4LEARNER]: after=[(n1,s1):1 (n3,s3):2 (n2,s2):4LEARNER] next=5

We see in the logs that this overlaps with another attempt at exporting r64 because ~2 minutes later, we see this log line:

W230608 13:34:00.041299 275528 kv/kvclient/kvcoord/dist_sender.go:1710 ⋮ [T1,n1,f‹2866f611›,job=872155068781232130,distsql.gateway=2,distsql.appname=‹$ internal-resume-job-872155068781232130›] 792 slow range RPC: have been waiting 230.44s (1 attempts) for RPC Export [/Table/106/1,/Table/106/2), [wait-policy: Error] to r64:‹/{Table/106-Max}› [(n1,s1):1, (n3,s3):2LEARNER, next=3, gen=2]; resp: ‹(err: ), *kvpb.ExportRespo nse›

Less than a minute later, at around 13:34:43, we see n3 roll back the learner because sending the snapshot failed:

giving up during snapshot reservation due to "kv.snapshot_receiver.reservation_queue_timeout_fraction": context deadline exceeded

This is congruent with our earlier observation that there might still be a snapshot n1->n2 inflight - that snapshot would delay n3's.

Finally, at 13:36:49, n1's snapshot fails with "Raft group deleted"; makes sense since that Replica is long stale.

There is some more upreplication and it looks like the lease transfers (?) again from n3 to n1, and they step on each others toes more:

I230608 13:39:43.315732 20329 kv/kvserver/replica_command.go:1109 ⋮ [T1,n3,replicate,s3,r64/2:‹/Table/106{-/1/2606…}›] 586 could not promote [n5,s5] to voter, rolling back: change replicas of r64 failed: descripto r changed: [expected] r64:‹/{Table/106-Max}› [(n1,s1):1, (n3,s3):2, (n4,s4):5, (n5,s5):6LEARNER, next=7, gen=10] != [actual] r64:‹/Table/106{-/1/260623}› [(n1,s1):1, (n3,s3):2, (n4,s4):5, (n5,s5):6LEARNER, next=7, gen=11]

In particular, we hit this helpful logging:

I230608 13:37:24.016304 21504 kv/kvserver/replica_command.go:436 ⋮ [T1,n3,split,s3,r64/2:‹/{Table/106-Max}›] 162 initiating a split of this range at key /Table/106/1/‹260623› [r74] (‹16 GiB above threshold size 64 MiB›)‹; r64/6 is being probed (may or may not need a Raft snapshot)› ... I230608 13:39:43.453866 20487 kv/kvserver/replica_gc_queue.go:307 ⋮ [T1,n5,replicaGC,s5,r64/6:‹/{Table/106-Max}›] 189 removing replica with pending split; will incur Raft snapshot for right hand side

Thanks to 2018 Tobias, we have a helpful comment^2. Basically, if we split a range and then remove a replica, if that removed replica hasn't applied the split yet, it wipes out the would-be right hand side of the split. In other words, there's now a replica in the RHS that needs a snapshot - in our case, r74.

The problem is now that we're carrying out tons of splits due to range size, and they're not delayable:

https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/split_queue.go#L261-L274

Details The way splits work is that they carve off chunks off the LHS. So r74 will be split again, its RHS will be split again, etc., and each such split causes another snapshot. It would have helped here if these splits had been delayable, but since they delay by at most ~20s and the snapshots take "minutes", it likely would've slowed the problem but not eliminated it.[^3][^4] One might try to argue that aggressively replicaGC'ing r64/6 was the issue. But note that r64/6 had been removed from the range, so it wouldn't necessarily even be caught up to the split in the first place, but it will block snapshots to its keyspace. So it does need to get removed ASAP. What could have been delayed is removing r64/6 from the descriptor; but how was the replicate queue on another node supposed to know that that replica was still pending a split? It could in theory have known - the split and the removal of the learner are both driven by the leaseholder, so similarly how we coordinate snapshots between the replicate and raft snapshot queue, the replicate queue could back off if there's an ongoing split attempt (and the split queue would have to wait until all followers have applied the split). I am not advocating for actually doing this, since all of these dependencies between queues are a real problem in and of itself. But there is a compelling argument that the independence of these queues is a problem.

at this point we're flooding the raft snapshot queue with requests that it will be very slow to process. They are also following a pathological pattern due to the sequence of splits: each inflight snapshot will, by the time it arrives, make a replica that requires a ton more splits in order to be up to date, and which will block all but the snapshot for the very first split in that wide range. This means most of the time the raft snapshot queue will just cycle through replicas, getting rejections for all but ~1 replica, loop around, etc, each time wasting the 10min scanner interval. In other words, the backlog will clear out very slowly.

But we haven't lost quorum at this point. This happens only later, when we make this ill-fated conf change:

I230608 13:40:42.296384 72105 13@kv/kvserver/replica_raft.go:387 ⋮ [T1,n3,s3,r548/2:‹/Table/106/1/119{1187…-3768…}›] 1651 proposing ENTER_JOINT(r2 l2 v4) [(n2,s2):4VOTER_INCOMING], [(n3,s3):2VOTER_DEMOTING_LEARNER ]: after=[(n1,s1):1 (n3,s3):2VOTER_DEMOTING_LEARNER (n4,s4):3 (n2,s2):4VOTER_INCOMING] next=5

The config is (s1, s3, s4) && (s1, s2, s4) but when the replica circuit breaker trips a minute later, we see that the replicas on s1 and s2 both need a snapshot. s1's need for a snapshot was likely created in the way described above.

E230608 13:41:43.916584 90026 kv/kvserver/replica_circuit_breaker.go:175 ⋮ [T1,n3,s3,r548/2:‹/Table/106/1/119{1187…-3768…}›] 841  breaker: tripped with error: replica unavailable: (n3,s3):2VOTER_DEMOTING_LEARNER unable to serve request to r548:‹/Table/106/1/119{118742-376853}› [(n1,s1):1, (n3,s3):2VOTER_DEMOTING_LEARNER, (n4,s4):3, (n2,s2):4VOTER_INCOMING, next=5, gen=475]: closed timestamp: 1686231639.296513274,0 (2023-06-08 13:40:39); raft status: {"id":"2","term":7,"vote":"2","commit":29,"lead":"2","raftState":"StateLeader","applied":29,"progress":{"1":{"match":0,"next":1,"state":"StateSnapshot"},"2":{"match":50,"next":51,"state":"
StateReplicate"},"3":{"match":50,"next":52,"state":"StateReplicate"},"4":{"match":0,"next":1,"state":"StateSnapshot"}},"leadtransferee":"0"}: have been waiting 62.00s for slow proposal ConditionalPut [/Local/Range/Table/106/1/‹119118742›/‹RangeDescriptor›,/Min), [txn: 97e8d7ec]

s2 is just being promoted and this is what causes the loss of quorum. But, the promotion was carried out by the replicate queue, and didn't it just add the learner and sent it a snapshot? How is s2 snapshotless now (note the Match of zero above)? It did apply it:

I230608 13:40:42.284060 34141 kv/kvserver/replica_raftstorage.go:514 ⋮ [T1,n2,s2,r548/4:‹/Table/106/1/119{1187…-3768…}›] 303 applied INITIAL snapshot 75ebb1f1 from (n4,s4):3 at applied index 25 (total=9ms data=29 MiB ingestion=6@3ms)

The debug.zip also shows this replica present on s2.

So this appears to be a failure in the replication layer - the data is there, but Raft is sticking to StateSnapshot even though it could be replicating to the follower.

I've written elsewhere[^6] about the oddities of raft snapshot handling. It's just generally a little brittle and especially assumes much closer coupling between raft and the snapshots than is present in CRDB.

Pure conjecture, but possible the following happened here:

[^4]: besides, if the range is very large that might cause problems in itself, like stalling writes, and it may not be wise to delay splits.

[^6]: I think it would move it back, but need to check.

tbg commented 1 year ago

Pure conjecture, but possible the following happened here:

I think this might be fixed as of https://github.com/cockroachdb/cockroach/pull/106793/files#diff-1e9884dfbcc4eb6d90a152e1023fd2a84dcda68f04a877be8a01c16df0fe0331R2888. When the MsgAppResp is received by the leader, this ought to be almost enough, and that PR achieves that. The one remaining edge case is that fixed by https://github.com/etcd-io/raft/pull/84, so once we have that one in as well (tracked in https://github.com/cockroachdb/cockroach/issues/106813), I think we can consider closing this test failure issue.

tbg commented 1 year ago

Filed https://github.com/cockroachdb/cockroach/issues/107520 to track the general poor interplay between splits and snapshots as triggered by replica movement in this test. It's something we (collectively) know about and this has been "a thing" for years, but it's hard to track since it lives in the spaces in between components. That issue will have to do.

dt commented 1 year ago

We apply a 5min timeout to this export somewhere. So if you ever have a large range like this, backups are not going to work

I don't follow this assertion. Individual ExportRequests, which are what have the 5min timeout, are paginated to a fixed 16mib, regardless of the underlying range size, so I don't think there is an automatic assumption that a bigger range should expect to hit this timeout. The only mechanism that comes to mind right now would very few keys in the range above the requested timestamp for an incremental exportreq, so the range iterates extensively without finding keys that match the timestamp predicate to return, meaning it doesn't hit the 16mib limit. We can discuss that more over in #107519 if we think that's what's up, but I certainly don't think we expect a large range to need a longer per-req timeout.

tbg commented 1 year ago

You just know more than I do, feel free to re-title https://github.com/cockroachdb/cockroach/issues/107519 or close if this doesn't capture anything new! I only skimmed the surface in looking at the export and wasn't digging deeper into how it works under the hood.

tbg commented 1 year ago

Closing. https://github.com/cockroachdb/cockroach/issues/104588#issuecomment-1649435465 and https://github.com/cockroachdb/cockroach/issues/106813 on the assumption that https://github.com/cockroachdb/cockroach/pull/106793 practically removes the erroneous StateSnapshot-despite-having-received-snapshot.