ccl/multiregionccl: TestMultiRegionDataDriven_regional_by_table failed [trying to remove a replica that doesn't exist]

cockroach-teamcity commented 5 days ago

ccl/multiregionccl.TestMultiRegionDataDriven_regional_by_table failed on release-24.1 @ e5f8e9cc5c6fd09ac54a8cc087f9855e3e6d772d:

             0.562ms      0.003ms                === operation:colbatchscan _unfinished:1 _verbose:1 node:8 client:127.0.0.1:42414 hostssl: user:root cockroach.flowid:dd3d8c48-a45e-4acf-b8e8-1f9a98a74f5d cockroach.processorid:0
             0.562ms      0.000ms                [/cockroach.roachpb.Internal/Batch: {count: 2, duration 5ms}]
             0.562ms      0.000ms                [txn coordinator send: {count: 1, duration 5ms}]
             0.562ms      0.000ms                [dist sender send: {count: 1, duration 5ms}]
             0.584ms      0.022ms                event:sql/row/kv_batch_fetcher.go:530 [n8,client=127.0.0.1:42414,hostssl,user=root] Scan /Table/108/1/1/0
             0.587ms      0.003ms                    === operation:txn coordinator send _verbose:1 node:8 client:127.0.0.1:42414 hostssl: user:root txnID:01788ba3-f8e1-4c2b-a9f0-d1c4a3941a0a
             0.587ms      0.000ms                    [/cockroach.roachpb.Internal/Batch: {count: 2, duration 5ms}]
             0.587ms      0.000ms                    [dist sender send: {count: 1, duration 5ms}]
             0.593ms      0.006ms                        === operation:dist sender send _verbose:1 node:8 client:127.0.0.1:42414 hostssl: user:root txn:01788ba3
             0.593ms      0.000ms                        [/cockroach.roachpb.Internal/Batch: {count: 2, duration 5ms}]
             0.607ms      0.015ms                        event:kv/kvclient/kvcoord/range_iter.go:183 [n8,client=127.0.0.1:42414,hostssl,user=root,txn=01788ba3] querying next range at /Table/108/1/1/0
             0.637ms      0.029ms                        event:kv/kvclient/kvcoord/range_iter.go:220 [n8,client=127.0.0.1:42414,hostssl,user=root,txn=01788ba3] key: /Table/108/1/1/0, desc: r69:/{Table/108-Max} [(n2,s2):9, (n3,s3):2, (n1,s1):8, (n7,s7):6NON_VOTER, next=12, gen=34]
             0.713ms      0.077ms                        event:kv/kvclient/kvcoord/dist_sender.go:2549 [n8,client=127.0.0.1:42414,hostssl,user=root,txn=01788ba3] routing to nearest replica; leaseholder not required order=[(n7,s7):6NON_VOTER(health=false match=0 latency=34ms),(n1,s1):8(health=false match=0 latency=38ms),(n3,s3):2(health=false match=0 latency=39ms),(n2,s2):9(health=false match=0 latency=43ms)]
             0.733ms      0.020ms                        event:kv/kvclient/kvcoord/dist_sender.go:2611 [n8,client=127.0.0.1:42414,hostssl,user=root,txn=01788ba3] r69: sending batch 1 Get to (n7,s7):6NON_VOTER
             0.745ms      0.012ms                        event:rpc/nodedialer/nodedialer.go:142 [n8,client=127.0.0.1:42414,hostssl,user=root,txn=01788ba3] sending request to 127.0.0.1:35649
             0.755ms      0.010ms                        event:kv/kvclient/kvcoord/transport.go:208 [n8,client=127.0.0.1:42414,hostssl,user=root,txn=01788ba3] sending batch request
             0.757ms      0.002ms                            === operation:/cockroach.roachpb.Internal/Batch _verbose:1 node:8 client:127.0.0.1:42414 hostssl: user:root txn:01788ba3 span.kind:client
             0.905ms      0.147ms                            === operation:/cockroach.roachpb.Internal/Batch _verbose:1 node:7 span.kind:server request:Get [/Table/108/1/1/0], [txn: 01788ba3], [max_span_request_keys: 0], [target_bytes: 10485760]
             0.945ms      0.041ms                            event:server/node.go:1494 [n7] node received request: 1 Get
             0.970ms      0.025ms                            event:kv/kvserver/store_send.go:149 [n7,s7] executing Get [/Table/108/1/1/0], [txn: 01788ba3], [max_span_request_keys: 0], [target_bytes: 10485760]
             0.981ms      0.011ms                            event:kv/kvserver/replica_send.go:175 [n7,s7,r69/6:/{Table/108-Max}] read-only path
             0.991ms      0.010ms                            event:kv/kvserver/concurrency/concurrency_manager.go:209 [n7,s7,r69/6:/{Table/108-Max}] sequencing request
             0.998ms      0.006ms                            event:kv/kvserver/concurrency/concurrency_manager.go:290 [n7,s7,r69/6:/{Table/108-Max}] acquiring latches
             1.006ms      0.008ms                            event:kv/kvserver/concurrency/concurrency_manager.go:334 [n7,s7,r69/6:/{Table/108-Max}] scanning lock table for conflicting locks
             1.022ms      0.016ms                            event:kv/kvserver/replica_follower_read.go:122 [n7,s7,r69/6:/{Table/108-Max}] serving via follower read; query timestamp below closed timestamp by 560.093295ms
             1.041ms      0.020ms                            event:kv/kvserver/replica_read.go:307 [n7,s7,r69/6:/{Table/108-Max}] can drop latches early for batch (Get [/Table/108/1/1/0], [txn: 01788ba3], [max_span_request_keys: 0], [target_bytes: 10485760]); scanning lock table first to detect conflicts
             1.092ms      0.051ms                            event:storage/engine.go:2087 [n7,s7,r69/6:/{Table/108-Max}] lock table scan stats: seeked 1 times (1 internal); stepped 0 times (0 internal); blocks: 17KB cached; points: 0 (0B keys, 0B values); separated: 8 (445B, 0B fetched)
             1.100ms      0.008ms                            event:kv/kvserver/replica_read.go:119 [n7,s7,r69/6:/{Table/108-Max}] lock table scan complete without conflicts; dropping latches early
             1.108ms      0.008ms                            event:kv/kvserver/replica_read.go:420 [n7,s7,r69/6:/{Table/108-Max}] executing read-only batch
             1.176ms      0.067ms                            event:kv/kvserver/replica_evaluate.go:507 [n7,s7,r69/6:/{Table/108-Max}] evaluated Get command header:<key:"\364\211\211\210" > , txn="sql txn" meta={id=01788ba3 key=/Min iso=Serializable pri=0.00251513 epo=0 ts=1731220280.816163000,0 min=1731220280.816163000,0 seq=0} lock=false stat=PENDING rts=1731220280.816163000,0 wto=false gul=1731220280.816163000,0 : resp=header:<num_keys:1 num_bytes:7 > value:<raw_bytes:"\375\311[\010\n#\002" timestamp:<wall_time:1731220268944340464 > > , err=<nil>
             1.194ms      0.018ms                            event:scan stats: stepped 0 times (0 internal); seeked 1 times (1 internal); block-bytes: (total 0 B, cached 0 B, duration 0s); points: (count 1, key-bytes 14 B, value-bytes 14 B, tombstoned: 0) ranges: (count 0), (contained-points 1, skipped-points 0) evaluated requests: 1 gets, 0 scans, 0 reverse scans
             1.204ms      0.010ms                            event:kv/kvserver/replica_read.go:232 [n7,s7,r69/6:/{Table/108-Max}] read completed
             1.223ms      0.019ms                            event:server/node.go:1588 [n7] node sending response
             1.179ms     -0.043ms                            structured:{"@type":"type.googleapis.com/cockroach.roachpb.ScanStats","numInterfaceSeeks":"1","numInternalSeeks":"1","keyBytes":"14","valueBytes":"14","pointCount":"1","rangeKeyContainedPoints":"1","numGets":"1"}
             5.086ms      4.330ms                        event:kv/kvclient/kvcoord/transport.go:210 [n8,client=127.0.0.1:42414,hostssl,user=root,txn=01788ba3] received batch response
             5.119ms      4.532ms                    event:kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:656 [n8,client=127.0.0.1:42414,hostssl,user=root,txn=01788ba3] recording span to refresh: /Table/108/1/1/0
             5.169ms      4.609ms            event:ComponentStats{ID: {dd3d8c48-a45e-4acf-b8e8-1f9a98a74f5d PROCESSOR 0 8 }, KV time: 5ms, KV contention time: 0µs, KV rows decoded: 1, KV bytes read: 11 B, KV gRPC calls: 1, KV pairs read: 1, MVCC step count (ext/int): 0/0, MVCC seek count (ext/int): 1/1, max memory allocated: 20 KiB, sql cpu time: 13µs, batches output: 1, rows output: 1}
             5.175ms      0.006ms            event:ComponentStats{ID: {00000000-0000-0000-0000-000000000000 UNSET 0 ? }}
             5.150ms     -0.025ms            structured:{"@type":"type.googleapis.com/cockroach.sql.distsqlrun.ComponentStats","component":{"flowId":"dd3d8c48-a45e-4acf-b8e8-1f9a98a74f5d","type":"PROCESSOR","sqlInstanceId":8},"netRx":{"latency":{},"waitTime":{},"deserializationTime":{},"tuplesReceived":{},"bytesReceived":{},"messagesReceived":{}},"netTx":{"tuplesSent":{},"bytesSent":{},"messagesSent":{}},"kv":{"bytesRead":{"valuePlusOne":"12"},"kvPairsRead":{"valuePlusOne":"2"},"tuplesRead":{"valuePlusOne":"2"},"batchRequestsIssued":{"valuePlusOne":"2"},"kvTime":{"valuePlusOne":"0.004578035s"},"contentionTime":{"valuePlusOne":"0.000000001s"},"numInterfaceSteps":{"valuePlusOne":"1"},"numInternalSteps":{"valuePlusOne":"1"},"numInterfaceSeeks":{"valuePlusOne":"2"},"numInternalSeeks":{"valuePlusOne":"2"},"blockBytes":{"valuePlusOne":"1"},"blockBytesInCache":{"valuePlusOne":"1"},"keyBytes":{"valuePlusOne":"15"},"valueBytes":{"valuePlusOne":"15"},"pointCount":{"valuePlusOne":"2"},"pointsCoveredByRangeTombstones":{"valuePlusOne":"1"},"rangeKeyCount":{"valuePlusOne":"1"},"rangeKeyContainedPoints":{"valuePlusOne":"2"},"rangeKeySkippedPoints":{"valuePlusOne":"1"},"kvCpuTime":{},"numGets":{"valuePlusOne":"2"},"numScans":{"valuePlusOne":"1"},"numReverseScans":{"valuePlusOne":"1"}},"exec":{"execTime":{},"maxAllocatedMem":{"valuePlusOne":"20481"},"maxAllocatedDisk":{},"consumedRu":{"valuePlusOne":"1"},"cpuTime":{"valuePlusOne":"0.000013229s"}},"output":{"numBatches":{"valuePlusOne":"2"},"numTuples":{"valuePlusOne":"2"}},"flowStats":{"maxMemUsage":{},"maxDiskUsage":{},"consumedRu":{}}}
             5.171ms      0.021ms            structured:{"@type":"type.googleapis.com/cockroach.sql.distsqlrun.ComponentStats","component":{"flowId":"00000000-0000-0000-0000-000000000000"},"netRx":{"latency":{},"waitTime":{},"deserializationTime":{},"tuplesReceived":{},"bytesReceived":{},"messagesReceived":{}},"netTx":{"tuplesSent":{},"bytesSent":{},"messagesSent":{}},"kv":{"bytesRead":{},"kvPairsRead":{},"tuplesRead":{},"batchRequestsIssued":{},"kvTime":{},"contentionTime":{},"numInterfaceSteps":{},"numInternalSteps":{},"numInterfaceSeeks":{},"numInternalSeeks":{},"blockBytes":{},"blockBytesInCache":{},"keyBytes":{},"valueBytes":{},"pointCount":{},"pointsCoveredByRangeTombstones":{},"rangeKeyCount":{},"rangeKeyContainedPoints":{},"rangeKeySkippedPoints":{},"kvCpuTime":{},"numGets":{},"numScans":{},"numReverseScans":{}},"exec":{"execTime":{},"maxAllocatedMem":{},"maxAllocatedDisk":{},"consumedRu":{},"cpuTime":{}},"output":{"numBatches":{},"numTuples":{}},"flowStats":{"maxMemUsage":{},"maxDiskUsage":{},"consumedRu":{}}}
             5.193ms      4.634ms        event:ComponentStats{ID: {dd3d8c48-a45e-4acf-b8e8-1f9a98a74f5d FLOW 0 8 eu-central-1}}
             5.190ms     -0.003ms        structured:{"@type":"type.googleapis.com/cockroach.sql.distsqlrun.ComponentStats","component":{"flowId":"dd3d8c48-a45e-4acf-b8e8-1f9a98a74f5d","type":"FLOW","sqlInstanceId":8,"region":"eu-central-1"},"netRx":{"latency":{},"waitTime":{},"deserializationTime":{},"tuplesReceived":{},"bytesReceived":{},"messagesReceived":{}},"netTx":{"tuplesSent":{},"bytesSent":{},"messagesSent":{}},"kv":{"bytesRead":{},"kvPairsRead":{},"tuplesRead":{},"batchRequestsIssued":{},"kvTime":{},"contentionTime":{},"numInterfaceSteps":{},"numInternalSteps":{},"numInterfaceSeeks":{},"numInternalSeeks":{},"blockBytes":{},"blockBytesInCache":{},"keyBytes":{},"valueBytes":{},"pointCount":{},"pointsCoveredByRangeTombstones":{},"rangeKeyCount":{},"rangeKeyContainedPoints":{},"rangeKeySkippedPoints":{},"kvCpuTime":{},"numGets":{},"numScans":{},"numReverseScans":{}},"exec":{"execTime":{},"maxAllocatedMem":{},"maxAllocatedDisk":{},"consumedRu":{},"cpuTime":{}},"output":{"numBatches":{},"numTuples":{}},"flowStats":{"maxMemUsage":{"valuePlusOne":"20481"},"maxDiskUsage":{"valuePlusOne":"1"},"consumedRu":{}}}
             5.217ms      4.709ms    event:sql/conn_executor_exec.go:1158 [n8,client=127.0.0.1:42414,hostssl,user=root] execution ends
             5.228ms      0.011ms    event:sql/conn_executor_exec.go:1158 [n8,client=127.0.0.1:42414,hostssl,user=root] rows affected: 1
             5.273ms      0.045ms        === operation:commit sql txn _verbose:1 node:8 client:127.0.0.1:42414 hostssl: user:root
             5.296ms      0.068ms    event:sql/conn_executor_exec.go:3159 [n8,client=127.0.0.1:42414,hostssl,user=root] AutoCommit. err: <nil>

    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestMultiRegionDataDriven_regional_by_table792069602
--- FAIL: TestMultiRegionDataDriven_regional_by_table (79.23s)

Parameters:

attempt=1
run=22
shard=4
Help

Same failure on other branches

- #132041 ccl/multiregionccl: TestMultiRegionDataDriven_regional_by_table failed [query is not served locally even though it's running on a non-voter with AOST] [C-bug C-test-failure O-robot P-2 T-kv branch-release-24.2]

/cc @cockroachdb/sql-foundations

This test on roachdash | Improve this report!

Jira issue: CRDB-44249

rafiss commented 3 days ago

The full trace and some log snippets are below. The logs have a few messages about unexpected errors during replication, so checking if KV has any input on whether that is related to the failure.

I'm also attaching the test artifacts here so they aren't lost: 2024_11_10T06_23_47_000Z_pkg_ccl_multiregionccl_multiregionccl_test_test.outputs__outputs.zip

Click for test failure and CRDB logs

``` /var/lib/engflow/worker/work/1/exec/bazel-out/k8-fastbuild/bin/pkg/ccl/multiregionccl/multiregionccl_test_/multiregionccl_test.runfiles/com_github_cockroachdb_cockroach/pkg/ccl/multiregionccl/testdata/regional_by_table:76: SELECT * FROM db.rbt AS OF SYSTEM TIME follower_read_timestamp() WHERE k = 1 expected: served locally: true served via follower read: true found: not a match, trace: 0.000ms 0.000ms === operation:traced statement _unfinished:1 _verbose:1 node:8 client:127.0.0.1:42414 hostssl: user:root 0.000ms 0.000ms [flow: {count: 1, duration 5ms}] 0.000ms 0.000ms [/cockroach.roachpb.Internal/Batch: {count: 2, duration 5ms}] 0.000ms 0.000ms [colbatchscan: {count: 1, duration 5ms, unfinished}] 0.000ms 0.000ms [batch flow coordinator: {count: 1, duration 5ms}] 0.000ms 0.000ms [txn coordinator send: {count: 1, duration 5ms}] 0.000ms 0.000ms [dist sender send: {count: 1, duration 5ms}] 0.000ms 0.000ms [optimizer: {count: 1, duration 371µs}] 0.000ms 0.000ms [commit sql txn: {count: 1, duration 6µs}] 0.016ms 0.016ms === operation:optimizer _verbose:1 node:8 client:127.0.0.1:42414 hostssl: user:root 0.038ms 0.022ms event:sql/plan_opt.go:402 [n8,client=127.0.0.1:42414,hostssl,user=root] query cache miss 0.057ms 0.019ms event:sql/opt/optbuilder/builder.go:212 [n8,client=127.0.0.1:42414,hostssl,user=root] optbuilder start 0.114ms 0.056ms event:sql/catalog/lease/descriptor_version_state.go:138 [n8,client=127.0.0.1:42414,hostssl,user=root] descriptorVersionState.incRefCount: 100("defaultdb",010180c8093df0cee84de492cb2c20d074ddb1) ver=1:1731220568.340847833,0, refcount=1 0.126ms 0.013ms event:sql/catalog/descs/descriptor.go:132 [n8,client=127.0.0.1:42414,hostssl,user=root] looking up descriptors for ids [100] 0.144ms 0.018ms event:sql/catalog/lease/descriptor_version_state.go:138 [n8,client=127.0.0.1:42414,hostssl,user=root] descriptorVersionState.incRefCount: 104("db",010180c8093df0cee84de492cb2c20d074ddb1) ver=1:1731220595.438378990,0, refcount=1 0.152ms 0.008ms event:sql/catalog/descs/descriptor.go:132 [n8,client=127.0.0.1:42414,hostssl,user=root] looking up descriptors for ids [104] 0.160ms 0.008ms event:sql/catalog/descs/descriptor.go:132 [n8,client=127.0.0.1:42414,hostssl,user=root] looking up descriptors for ids [105] 0.174ms 0.013ms event:sql/catalog/lease/descriptor_version_state.go:138 [n8,client=127.0.0.1:42414,hostssl,user=root] descriptorVersionState.incRefCount: 105("public",010180c8093df0cee84de492cb2c20d074ddb1) ver=1:1731220595.704877717,0, refcount=1 0.189ms 0.015ms event:sql/catalog/lease/descriptor_version_state.go:138 [n8,client=127.0.0.1:42414,hostssl,user=root] descriptorVersionState.incRefCount: 108("rbt",010180c8093df0cee84de492cb2c20d074ddb1) ver=1:1731220584.506325867,0, refcount=1 0.197ms 0.009ms event:sql/catalog/descs/descriptor.go:132 [n8,client=127.0.0.1:42414,hostssl,user=root] looking up descriptors for ids [108] 0.292ms 0.095ms event:sql/opt/optbuilder/builder.go:258 [n8,client=127.0.0.1:42414,hostssl,user=root] optbuilder finish 0.300ms 0.007ms event:sql/opt/xform/optimizer.go:236 [n8,client=127.0.0.1:42414,hostssl,user=root] optimize start 0.336ms 0.036ms event:sql/opt/xform/optimizer.go:285 [n8,client=127.0.0.1:42414,hostssl,user=root] optimize finish 0.344ms 0.007ms event:sql/plan_opt.go:402 [n8,client=127.0.0.1:42414,hostssl,user=root] query cache add 0.412ms 0.412ms event:sql/conn_executor_exec.go:1158 [n8,client=127.0.0.1:42414,hostssl,user=root] planning ends 0.426ms 0.014ms event:sql/conn_executor_exec.go:1158 [n8,client=127.0.0.1:42414,hostssl,user=root] checking distributability 0.436ms 0.011ms event:sql/conn_executor_exec.go:1158 [n8,client=127.0.0.1:42414,hostssl,user=root] will distribute plan: false 0.448ms 0.012ms event:sql/conn_executor_exec.go:1158 [n8,client=127.0.0.1:42414,hostssl,user=root] executing after 0 retries, last retry reason: 0.458ms 0.010ms event:sql/conn_executor_exec.go:1158 [n8,client=127.0.0.1:42414,hostssl,user=root] execution starts: distributed engine 0.496ms 0.038ms event:sql/distsql_running.go:1974 [n8,client=127.0.0.1:42414,hostssl,user=root] creating DistSQL plan with isLocal=true 0.508ms 0.012ms event:sql/distsql_running.go:830 [n8,client=127.0.0.1:42414,hostssl,user=root] running DistSQL plan 0.511ms 0.003ms === operation:flow _verbose:1 node:8 client:127.0.0.1:42414 hostssl: user:root 0.511ms 0.000ms [/cockroach.roachpb.Internal/Batch: {count: 2, duration 5ms}] 0.511ms 0.000ms [colbatchscan: {count: 1, duration 5ms, unfinished}] 0.511ms 0.000ms [batch flow coordinator: {count: 1, duration 5ms}] 0.511ms 0.000ms [txn coordinator send: {count: 1, duration 5ms}] 0.511ms 0.000ms [dist sender send: {count: 1, duration 5ms}] 0.523ms 0.012ms event:sql/colflow/vectorized_flow.go:226 [n8,client=127.0.0.1:42414,hostssl,user=root] setting up vectorized flow 0.541ms 0.018ms event:sql/colflow/vectorized_flow.go:267 [n8,client=127.0.0.1:42414,hostssl,user=root] vectorized flow setup succeeded 0.551ms 0.010ms event:sql/flowinfra/flow.go:460 [n8,client=127.0.0.1:42414,hostssl,user=root] starting (0 processors, 0 startables) asynchronously 0.559ms 0.008ms event:sql/colflow/vectorized_flow.go:314 [n8,client=127.0.0.1:42414,hostssl,user=root] running the batch flow coordinator in the flow's goroutine 0.560ms 0.000ms === operation:batch flow coordinator _verbose:1 node:8 client:127.0.0.1:42414 hostssl: user:root cockroach.flowid:dd3d8c48-a45e-4acf-b8e8-1f9a98a74f5d cockroach.processorid:0 0.560ms 0.000ms [/cockroach.roachpb.Internal/Batch: {count: 2, duration 5ms}] 0.560ms 0.000ms [colbatchscan: {count: 1, duration 5ms, unfinished}] 0.560ms 0.000ms [txn coordinator send: {count: 1, duration 5ms}] 0.560ms 0.000ms [dist sender send: {count: 1, duration 5ms}] 0.562ms 0.003ms === operation:colbatchscan _unfinished:1 _verbose:1 node:8 client:127.0.0.1:42414 hostssl: user:root cockroach.flowid:dd3d8c48-a45e-4acf-b8e8-1f9a98a74f5d cockroach.processorid:0 0.562ms 0.000ms [/cockroach.roachpb.Internal/Batch: {count: 2, duration 5ms}] 0.562ms 0.000ms [txn coordinator send: {count: 1, duration 5ms}] 0.562ms 0.000ms [dist sender send: {count: 1, duration 5ms}] 0.584ms 0.022ms event:sql/row/kv_batch_fetcher.go:530 [n8,client=127.0.0.1:42414,hostssl,user=root] Scan /Table/108/1/1/0 0.587ms 0.003ms === operation:txn coordinator send _verbose:1 node:8 client:127.0.0.1:42414 hostssl: user:root txnID:01788ba3-f8e1-4c2b-a9f0-d1c4a3941a0a 0.587ms 0.000ms [/cockroach.roachpb.Internal/Batch: {count: 2, duration 5ms}] 0.587ms 0.000ms [dist sender send: {count: 1, duration 5ms}] 0.593ms 0.006ms === operation:dist sender send _verbose:1 node:8 client:127.0.0.1:42414 hostssl: user:root txn:01788ba3 0.593ms 0.000ms [/cockroach.roachpb.Internal/Batch: {count: 2, duration 5ms}] 0.607ms 0.015ms event:kv/kvclient/kvcoord/range_iter.go:183 [n8,client=127.0.0.1:42414,hostssl,user=root,txn=01788ba3] querying next range at /Table/108/1/1/0 0.637ms 0.029ms event:kv/kvclient/kvcoord/range_iter.go:220 [n8,client=127.0.0.1:42414,hostssl,user=root,txn=01788ba3] key: /Table/108/1/1/0, desc: r69:/{Table/108-Max} [(n2,s2):9, (n3,s3):2, (n1,s1):8, (n7,s7):6NON_VOTER, next=12, gen=34] 0.713ms 0.077ms event:kv/kvclient/kvcoord/dist_sender.go:2549 [n8,client=127.0.0.1:42414,hostssl,user=root,txn=01788ba3] routing to nearest replica; leaseholder not required order=[(n7,s7):6NON_VOTER(health=false match=0 latency=34ms),(n1,s1):8(health=false match=0 latency=38ms),(n3,s3):2(health=false match=0 latency=39ms),(n2,s2):9(health=false match=0 latency=43ms)] 0.733ms 0.020ms event:kv/kvclient/kvcoord/dist_sender.go:2611 [n8,client=127.0.0.1:42414,hostssl,user=root,txn=01788ba3] r69: sending batch 1 Get to (n7,s7):6NON_VOTER 0.745ms 0.012ms event:rpc/nodedialer/nodedialer.go:142 [n8,client=127.0.0.1:42414,hostssl,user=root,txn=01788ba3] sending request to 127.0.0.1:35649 0.755ms 0.010ms event:kv/kvclient/kvcoord/transport.go:208 [n8,client=127.0.0.1:42414,hostssl,user=root,txn=01788ba3] sending batch request 0.757ms 0.002ms === operation:/cockroach.roachpb.Internal/Batch _verbose:1 node:8 client:127.0.0.1:42414 hostssl: user:root txn:01788ba3 span.kind:client 0.905ms 0.147ms === operation:/cockroach.roachpb.Internal/Batch _verbose:1 node:7 span.kind:server request:Get [/Table/108/1/1/0], [txn: 01788ba3], [max_span_request_keys: 0], [target_bytes: 10485760] 0.945ms 0.041ms event:server/node.go:1494 [n7] node received request: 1 Get 0.970ms 0.025ms event:kv/kvserver/store_send.go:149 [n7,s7] executing Get [/Table/108/1/1/0], [txn: 01788ba3], [max_span_request_keys: 0], [target_bytes: 10485760] 0.981ms 0.011ms event:kv/kvserver/replica_send.go:175 [n7,s7,r69/6:/{Table/108-Max}] read-only path 0.991ms 0.010ms event:kv/kvserver/concurrency/concurrency_manager.go:209 [n7,s7,r69/6:/{Table/108-Max}] sequencing request 0.998ms 0.006ms event:kv/kvserver/concurrency/concurrency_manager.go:290 [n7,s7,r69/6:/{Table/108-Max}] acquiring latches 1.006ms 0.008ms event:kv/kvserver/concurrency/concurrency_manager.go:334 [n7,s7,r69/6:/{Table/108-Max}] scanning lock table for conflicting locks 1.022ms 0.016ms event:kv/kvserver/replica_follower_read.go:122 [n7,s7,r69/6:/{Table/108-Max}] serving via follower read; query timestamp below closed timestamp by 560.093295ms 1.041ms 0.020ms event:kv/kvserver/replica_read.go:307 [n7,s7,r69/6:/{Table/108-Max}] can drop latches early for batch (Get [/Table/108/1/1/0], [txn: 01788ba3], [max_span_request_keys: 0], [target_bytes: 10485760]); scanning lock table first to detect conflicts 1.092ms 0.051ms event:storage/engine.go:2087 [n7,s7,r69/6:/{Table/108-Max}] lock table scan stats: seeked 1 times (1 internal); stepped 0 times (0 internal); blocks: 17KB cached; points: 0 (0B keys, 0B values); separated: 8 (445B, 0B fetched) 1.100ms 0.008ms event:kv/kvserver/replica_read.go:119 [n7,s7,r69/6:/{Table/108-Max}] lock table scan complete without conflicts; dropping latches early 1.108ms 0.008ms event:kv/kvserver/replica_read.go:420 [n7,s7,r69/6:/{Table/108-Max}] executing read-only batch 1.176ms 0.067ms event:kv/kvserver/replica_evaluate.go:507 [n7,s7,r69/6:/{Table/108-Max}] evaluated Get command header: , txn="sql txn" meta={id=01788ba3 key=/Min iso=Serializable pri=0.00251513 epo=0 ts=1731220280.816163000,0 min=1731220280.816163000,0 seq=0} lock=false stat=PENDING rts=1731220280.816163000,0 wto=false gul=1731220280.816163000,0 : resp=header: value: > , err= 1.194ms 0.018ms event:scan stats: stepped 0 times (0 internal); seeked 1 times (1 internal); block-bytes: (total 0 B, cached 0 B, duration 0s); points: (count 1, key-bytes 14 B, value-bytes 14 B, tombstoned: 0) ranges: (count 0), (contained-points 1, skipped-points 0) evaluated requests: 1 gets, 0 scans, 0 reverse scans 1.204ms 0.010ms event:kv/kvserver/replica_read.go:232 [n7,s7,r69/6:/{Table/108-Max}] read completed 1.223ms 0.019ms event:server/node.go:1588 [n7] node sending response 1.179ms -0.043ms structured:{"@type":"type.googleapis.com/cockroach.roachpb.ScanStats","numInterfaceSeeks":"1","numInternalSeeks":"1","keyBytes":"14","valueBytes":"14","pointCount":"1","rangeKeyContainedPoints":"1","numGets":"1"} 5.086ms 4.330ms event:kv/kvclient/kvcoord/transport.go:210 [n8,client=127.0.0.1:42414,hostssl,user=root,txn=01788ba3] received batch response 5.119ms 4.532ms event:kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:656 [n8,client=127.0.0.1:42414,hostssl,user=root,txn=01788ba3] recording span to refresh: /Table/108/1/1/0 5.169ms 4.609ms event:ComponentStats{ID: {dd3d8c48-a45e-4acf-b8e8-1f9a98a74f5d PROCESSOR 0 8 }, KV time: 5ms, KV contention time: 0µs, KV rows decoded: 1, KV bytes read: 11 B, KV gRPC calls: 1, KV pairs read: 1, MVCC step count (ext/int): 0/0, MVCC seek count (ext/int): 1/1, max memory allocated: 20 KiB, sql cpu time: 13µs, batches output: 1, rows output: 1} 5.175ms 0.006ms event:ComponentStats{ID: {00000000-0000-0000-0000-000000000000 UNSET 0 ? }} 5.150ms -0.025ms structured:{"@type":"type.googleapis.com/cockroach.sql.distsqlrun.ComponentStats","component":{"flowId":"dd3d8c48-a45e-4acf-b8e8-1f9a98a74f5d","type":"PROCESSOR","sqlInstanceId":8},"netRx":{"latency":{},"waitTime":{},"deserializationTime":{},"tuplesReceived":{},"bytesReceived":{},"messagesReceived":{}},"netTx":{"tuplesSent":{},"bytesSent":{},"messagesSent":{}},"kv":{"bytesRead":{"valuePlusOne":"12"},"kvPairsRead":{"valuePlusOne":"2"},"tuplesRead":{"valuePlusOne":"2"},"batchRequestsIssued":{"valuePlusOne":"2"},"kvTime":{"valuePlusOne":"0.004578035s"},"contentionTime":{"valuePlusOne":"0.000000001s"},"numInterfaceSteps":{"valuePlusOne":"1"},"numInternalSteps":{"valuePlusOne":"1"},"numInterfaceSeeks":{"valuePlusOne":"2"},"numInternalSeeks":{"valuePlusOne":"2"},"blockBytes":{"valuePlusOne":"1"},"blockBytesInCache":{"valuePlusOne":"1"},"keyBytes":{"valuePlusOne":"15"},"valueBytes":{"valuePlusOne":"15"},"pointCount":{"valuePlusOne":"2"},"pointsCoveredByRangeTombstones":{"valuePlusOne":"1"},"rangeKeyCount":{"valuePlusOne":"1"},"rangeKeyContainedPoints":{"valuePlusOne":"2"},"rangeKeySkippedPoints":{"valuePlusOne":"1"},"kvCpuTime":{},"numGets":{"valuePlusOne":"2"},"numScans":{"valuePlusOne":"1"},"numReverseScans":{"valuePlusOne":"1"}},"exec":{"execTime":{},"maxAllocatedMem":{"valuePlusOne":"20481"},"maxAllocatedDisk":{},"consumedRu":{"valuePlusOne":"1"},"cpuTime":{"valuePlusOne":"0.000013229s"}},"output":{"numBatches":{"valuePlusOne":"2"},"numTuples":{"valuePlusOne":"2"}},"flowStats":{"maxMemUsage":{},"maxDiskUsage":{},"consumedRu":{}}} 5.171ms 0.021ms structured:{"@type":"type.googleapis.com/cockroach.sql.distsqlrun.ComponentStats","component":{"flowId":"00000000-0000-0000-0000-000000000000"},"netRx":{"latency":{},"waitTime":{},"deserializationTime":{},"tuplesReceived":{},"bytesReceived":{},"messagesReceived":{}},"netTx":{"tuplesSent":{},"bytesSent":{},"messagesSent":{}},"kv":{"bytesRead":{},"kvPairsRead":{},"tuplesRead":{},"batchRequestsIssued":{},"kvTime":{},"contentionTime":{},"numInterfaceSteps":{},"numInternalSteps":{},"numInterfaceSeeks":{},"numInternalSeeks":{},"blockBytes":{},"blockBytesInCache":{},"keyBytes":{},"valueBytes":{},"pointCount":{},"pointsCoveredByRangeTombstones":{},"rangeKeyCount":{},"rangeKeyContainedPoints":{},"rangeKeySkippedPoints":{},"kvCpuTime":{},"numGets":{},"numScans":{},"numReverseScans":{}},"exec":{"execTime":{},"maxAllocatedMem":{},"maxAllocatedDisk":{},"consumedRu":{},"cpuTime":{}},"output":{"numBatches":{},"numTuples":{}},"flowStats":{"maxMemUsage":{},"maxDiskUsage":{},"consumedRu":{}}} 5.193ms 4.634ms event:ComponentStats{ID: {dd3d8c48-a45e-4acf-b8e8-1f9a98a74f5d FLOW 0 8 eu-central-1}} 5.190ms -0.003ms structured:{"@type":"type.googleapis.com/cockroach.sql.distsqlrun.ComponentStats","component":{"flowId":"dd3d8c48-a45e-4acf-b8e8-1f9a98a74f5d","type":"FLOW","sqlInstanceId":8,"region":"eu-central-1"},"netRx":{"latency":{},"waitTime":{},"deserializationTime":{},"tuplesReceived":{},"bytesReceived":{},"messagesReceived":{}},"netTx":{"tuplesSent":{},"bytesSent":{},"messagesSent":{}},"kv":{"bytesRead":{},"kvPairsRead":{},"tuplesRead":{},"batchRequestsIssued":{},"kvTime":{},"contentionTime":{},"numInterfaceSteps":{},"numInternalSteps":{},"numInterfaceSeeks":{},"numInternalSeeks":{},"blockBytes":{},"blockBytesInCache":{},"keyBytes":{},"valueBytes":{},"pointCount":{},"pointsCoveredByRangeTombstones":{},"rangeKeyCount":{},"rangeKeyContainedPoints":{},"rangeKeySkippedPoints":{},"kvCpuTime":{},"numGets":{},"numScans":{},"numReverseScans":{}},"exec":{"execTime":{},"maxAllocatedMem":{},"maxAllocatedDisk":{},"consumedRu":{},"cpuTime":{}},"output":{"numBatches":{},"numTuples":{}},"flowStats":{"maxMemUsage":{"valuePlusOne":"20481"},"maxDiskUsage":{"valuePlusOne":"1"},"consumedRu":{}}} 5.217ms 4.709ms event:sql/conn_executor_exec.go:1158 [n8,client=127.0.0.1:42414,hostssl,user=root] execution ends 5.228ms 0.011ms event:sql/conn_executor_exec.go:1158 [n8,client=127.0.0.1:42414,hostssl,user=root] rows affected: 1 5.273ms 0.045ms === operation:commit sql txn _verbose:1 node:8 client:127.0.0.1:42414 hostssl: user:root 5.296ms 0.068ms event:sql/conn_executor_exec.go:3159 [n8,client=127.0.0.1:42414,hostssl,user=root] AutoCommit. err: ``` The kv-distribution logs have: ``` W241110 06:31:18.866377 133727 13@kv/kvserver/replica_command.go:3032 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 2778 attempt 1: delegate snapshot ‹range_id:69 coordinat or_replica: recipient_replica: delegated_sender: term:8 first_index:65 sender_queue_name:REPLICATE_QUEUE descriptor_generation:33 queue_on_delegate_len:3 snap_id:6b486993-5993-4a24-bb55-3d24aa04c863 › request f ailed error sending couldn't accept ‹range_id:69 coordinator_replica: recipient_replica: delegated_sender: term:8 first_index:65 sender_queue_name:REPLICATE_QUEUE descriptor_generation:33 queue_on_delegate_len: 3 snap_id:6b486993-5993-4a24-bb55-3d24aa04c863 ›: [n8,s8,r69/11:‹/{Table/108-Max}›]: couldn't find receiver replica (n8,s8):10NON_VOTER in sender descriptor r69:‹/{Table/108-Max}› [(n 2,s2):9, (n3,s3):2, (n1,s1):8, (n7,s7):6NON_VOTER, (n8,s8):11NON_VOTER, next=12, gen=33] W241110 06:31:18.874592 133727 13@kv/kvserver/replica_command.go:3032 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 2779 attempt 2: delegate snapshot ‹range_id:69 coordinat or_replica: recipient_replica: delegated_sender: term:8 first_index:65 sender_queue_name:REPLICATE_QUEUE descriptor_generation:33 queue_on_delegate_len:-1 snap_id:6b486993-5993-4a24-bb55-3d24aa04c863 › request failed error sending couldn't accept ‹range_id:69 coordinator_replica: recipient_replica: delegated_sender: term:8 first_index:65 sender_queue_name:REPLICATE_QUEUE descriptor_generation:33 queue_on_delegate_len :-1 snap_id:6b486993-5993-4a24-bb55-3d24aa04c863 ›: [n2,s2,r69/9:‹/{Table/108-Max}›]: couldn't find receiver replica (n8,s8):10NON_VOTER in sender descriptor r69:‹/{Table/108-Max}› [( n2,s2):9, (n3,s3):2, (n1,s1):8, (n7,s7):6NON_VOTER, (n8,s8):11NON_VOTER, next=12, gen=33] I241110 06:31:18.894633 133727 13@kv/kvserver/replica_command.go:2480 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 2780 change replicas (add [] remove [(n8,s8):11NON_VOTER]): existing descriptor r69:‹/{Table/108-Max}› [(n2,s2):9, (n3,s3):2, (n1,s1):8, (n7,s7):6NON_VOTER, (n8,s8):11NON_VOTER, next=12, gen=33] I241110 06:31:18.970443 137373 13@kv/kvserver/replica_raft.go:391 ⋮ [T1,Vsystem,n1,s1,r69/8:‹/{Table/108-Max}›] 2781 proposing SIMPLE(r11) [(n8,s8):11NON_VOTER]: after=[(n2,s2):9 (n3,s3):2 (n1,s1):8 (n7,s7):6NON_VOTER] next=12 I241110 06:31:19.013530 133727 13@kv/kvserver/replicate_queue.go:778 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 2782 error processing replica: error sending couldn't accept ‹range_id:69 coordinator_replica: recipient_replica: delegated_sender: term:8 first_index:65 sender_queue_name:REPLICATE_QUEUE descriptor_generation:33 queue_on_delegate_len:-1 snap_id:6b486993-5993-4a24-bb55-3d24aa04c863 ›: [n2,s2,r69/9:‹/{Table/108-Max}›]: couldn't find receiver replica (n8,s8):10NON_VOTER in sender descriptor r69:‹/{Table/108-Max}› [(n2,s2):9, (n3,s3):2, (n1,s1):8, (n7,s7):6NON_VOTER, (n8,s8):11NON_VOTER, next=12, gen=33] I241110 06:31:19.013576 133727 13@kv/kvserver/replicate_queue.go:677 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 2783 error sending couldn't accept ‹range_id:69 coordinator_replica: recipient_replica: delegated_sender: term:8 first_index:65 sender_queue_name:REPLICATE_QUEUE descriptor_generation:33 queue_on_delegate_len:-1 snap_id:6b486993-5993-4a24-bb55-3d24aa04c863 ›: [n2,s2,r69/9:‹/{Table/108-Max}›]: couldn't find receiver replica (n8,s8):10NON_VOTER in sender descriptor r69:‹/{Table/108-Max}› [(n2,s2):9, (n3,s3):2, (n1,s1):8, (n7,s7):6NON_VOTER, (n8,s8):11NON_VOTER, next=12, gen=33] I241110 06:31:19.057579 133727 13@kv/kvserver/allocator/plan/replicate.go:838 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 2784 rebalancing non-voter n6,s6 to n8,s8: [no raft progress] I241110 06:31:19.057640 133727 13@kv/kvserver/replicate_queue.go:778 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 2785 error processing replica: trying to remove a replica that doesn't exist: {ChangeType:REMOVE_NON_VOTER Target:n6,s6} I241110 06:31:19.057640 133727 13@kv/kvserver/replicate_queue.go:778 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 2785 +HINT: ‹You have encountered an unexpected error.› I241110 06:31:19.057640 133727 13@kv/kvserver/replicate_queue.go:778 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 2785 + I241110 06:31:19.057640 133727 13@kv/kvserver/replicate_queue.go:778 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 2785 +‹Please check the public issue tracker to check whether this problem is› I241110 06:31:19.057640 133727 13@kv/kvserver/replicate_queue.go:778 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 2785 +‹already tracked. If you cannot find it there, please report the error› I241110 06:31:19.057640 133727 13@kv/kvserver/replicate_queue.go:778 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 2785 +‹with details by creating a new issue.› I241110 06:31:19.057640 133727 13@kv/kvserver/replicate_queue.go:778 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 2785 + I241110 06:31:19.057640 133727 13@kv/kvserver/replicate_queue.go:778 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 2785 +‹If you would rather not post publicly, please contact us directly› I241110 06:31:19.057640 133727 13@kv/kvserver/replicate_queue.go:778 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 2785 +‹using the support form.› I241110 06:31:19.057640 133727 13@kv/kvserver/replicate_queue.go:778 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 2785 + I241110 06:31:19.057640 133727 13@kv/kvserver/replicate_queue.go:778 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 2785 +‹We appreciate your feedback.› ``` And the normal logs have: ``` W241110 06:31:18.805031 107180 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n3,rnode=2,raddr=‹127.0.0.1:34273›,class=default,rpc] 3672 latency jump (prev avg 34.34ms, current 71.01ms) W241110 06:31:18.805179 18932 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n9,rnode=1,raddr=‹127.0.0.1:35975›,class=rangefeed,rpc] 3673 latency jump (prev avg 38.22ms, current 69.87ms) W241110 06:31:18.842337 12457 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n7,rnode=2,raddr=‹127.0.0.1:34273›,class=system,rpc] 3674 latency jump (prev avg 29.20ms, current 98.95ms) W241110 06:31:18.842394 69560 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n3,rnode=8,raddr=‹127.0.0.1:46149›,class=system,rpc] 3675 latency jump (prev avg 33.54ms, current 98.85ms) W241110 06:31:18.842443 77407 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n4,rnode=5,raddr=‹127.0.0.1:37669›,class=rangefeed,rpc] 3676 latency jump (prev avg 33.32ms, current 98.71ms) W241110 06:31:18.842468 9702 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n4,rnode=6,raddr=‹127.0.0.1:36015›,class=system,rpc] 3677 latency jump (prev avg 44.99ms, current 99.24ms) W241110 06:31:18.864191 10194 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n6,rnode=4,raddr=‹127.0.0.1:33547›,class=rangefeed,rpc] 3678 latency jump (prev avg 33.02ms, current 80.50ms) W241110 06:31:18.866377 133727 13@kv/kvserver/replica_command.go:3032 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 3679 attempt 1: delegate snapshot ‹range_id:69 coordinat or_replica: recipient_replica: delegated_sender: term:8 first_index:65 sender_queue_name:REPLICATE_QUEUE descriptor_generation:33 queue_on_delegate_len:3 snap_id:6b486993-5993-4a24-bb55-3d24aa04c863 › request f ailed error sending couldn't accept ‹range_id:69 coordinator_replica: recipient_replica: delegated_sender: term:8 first_index:65 sender_queue_name:REPLICATE_QUEUE descriptor_generation:33 queue_on_delegate_len: 3 snap_id:6b486993-5993-4a24-bb55-3d24aa04c863 ›: [n8,s8,r69/11:‹/{Table/108-Max}›]: couldn't find receiver replica (n8,s8):10NON_VOTER in sender descriptor r69:‹/{Table/108-Max}› [(n 2,s2):9, (n3,s3):2, (n1,s1):8, (n7,s7):6NON_VOTER, (n8,s8):11NON_VOTER, next=12, gen=33] W241110 06:31:18.868339 12814 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n1,rnode=7,raddr=‹127.0.0.1:35649›,class=default,rpc] 3680 latency jump (prev avg 34.47ms, current 79.87ms) W241110 06:31:18.868381 15078 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n8,rnode=1,raddr=‹127.0.0.1:35975›,class=system,rpc] 3681 latency jump (prev avg 40.72ms, current 72.07ms) W241110 06:31:18.869601 111628 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n8,rnode=4,raddr=‹127.0.0.1:33547›,class=default,rpc] 3682 latency jump (prev avg 32.21ms, current 99.43ms) W241110 06:31:18.874592 133727 13@kv/kvserver/replica_command.go:3032 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 3683 attempt 2: delegate snapshot ‹range_id:69 coordinat or_replica: recipient_replica: delegated_sender: term:8 first_index:65 sender_queue_name:REPLICATE_QUEUE descriptor_generation:33 queue_on_delegate_len:-1 snap_id:6b486993-5993-4a24-bb55-3d24aa04c863 › request failed error sending couldn't accept ‹range_id:69 coordinator_replica: recipient_replica: delegated_sender: term:8 first_index:65 sender_queue_name:REPLICATE_QUEUE descriptor_generation:33 queue_on_delegate_len :-1 snap_id:6b486993-5993-4a24-bb55-3d24aa04c863 ›: [n2,s2,r69/9:‹/{Table/108-Max}›]: couldn't find receiver replica (n8,s8):10NON_VOTER in sender descriptor r69:‹/{Table/108-Max}› [( n2,s2):9, (n3,s3):2, (n1,s1):8, (n7,s7):6NON_VOTER, (n8,s8):11NON_VOTER, next=12, gen=33] I241110 06:31:18.874666 133727 kv/kvserver/replica_command.go:1860 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 3684 could not successfully add and upreplicate NON_VOTER replica(s) on [n8,s8], rolling back: error sending couldn't accept ‹range_id:69 coordinator_replica: recipient_replica: delegated_sender: term:8 first_index:65 sender_queue_name:REPLICATE_QUEUE descriptor_generation:33 queue_on_delegate_len:-1 snap_id:6b486993-5993-4a24-bb55-3d24aa04c863 ›: [n2,s2,r69/9:‹/{Table/108-Max}›]: couldn't find receiver replica (n8,s8):10NON_VOTER in sender descriptor r69:‹/{Table/108-Max}› [(n2,s2):9, (n3,s3):2, (n1,s1):8, (n7,s7):6NON_VOTER, (n8,s8):11NON_VOTER, next=12, gen=33] W241110 06:31:18.911784 9855 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n6,rnode=1,raddr=‹127.0.0.1:35975›,class=default,rpc] 3685 latency jump (prev avg 40.13ms, current 60.89ms) I241110 06:31:18.929951 136846 sql/catalog/lease/descriptor_state.go:253 ⋮ [T1,Vsystem,n5,intExec=‹create-stats›] 3686 release: 1(‹"system"›,0101801af9f368c3924b13bc998fa4a694ad18) ver=1:1731220509.785758251,0, refcount=3 I241110 06:31:18.929974 136846 sql/catalog/lease/descriptor_state.go:253 ⋮ [T1,Vsystem,n5,intExec=‹create-stats›] 3687 release: 15(‹"jobs"›,0101801af9f368c3924b13bc998fa4a694ad18) ver=1:1731220502.555603163,0, refcount=2 I241110 06:31:18.929993 136846 sql/catalog/lease/descriptor_state.go:253 ⋮ [T1,Vsystem,n5,intExec=‹create-stats›] 3688 release: 54(‹"job_info"›,0101801af9f368c3924b13bc998fa4a694ad18) ver=1:1731220578.270367561,0, refcount=0 I241110 06:31:19.002841 136857 sql/catalog/lease/descriptor_state.go:253 ⋮ [T1,Vsystem,n6,intExec=‹create-stats›] 3689 release: 1(‹"system"›,010180026c1edcfe0e4ea982b1e9fbf8eaab0a) ver=1:1731220509.580351528,0, refcount=3 I241110 06:31:19.002867 136857 sql/catalog/lease/descriptor_state.go:253 ⋮ [T1,Vsystem,n6,intExec=‹create-stats›] 3690 release: 15(‹"jobs"›,010180026c1edcfe0e4ea982b1e9fbf8eaab0a) ver=1:1731220516.515805545,0, refcount=2 I241110 06:31:19.002883 136857 sql/catalog/lease/descriptor_state.go:253 ⋮ [T1,Vsystem,n6,intExec=‹create-stats›] 3691 release: 54(‹"job_info"›,010180026c1edcfe0e4ea982b1e9fbf8eaab0a) ver=1:1731220573.186025872,0, refcount=0 I241110 06:31:19.013458 133727 kv/kvserver/replica_command.go:2137 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 3692 rolled back NON_VOTER n8,s8 in r69:‹/{Table/108-Max}› [(n2,s2):9, (n3,s3):2, (n1,s1):8, (n7,s7):6NON_VOTER, (n8,s8):11NON_VOTER, next=12, gen=33] I241110 06:31:19.040197 14859 kv/kvserver/store_remove_replica.go:152 ⋮ [T1,Vsystem,n8,s8,r69/11:‹/{Table/108-Max}›,raft] 3693 removing replica r69/11 E241110 06:31:19.057697 133727 kv/kvserver/queue.go:1197 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 3694 trying to remove a replica that doesn't exist: {ChangeType:REMOVE_NON_VOTER Target:n6,s6} E241110 06:31:19.057697 133727 kv/kvserver/queue.go:1197 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 3694 +HINT: ‹You have encountered an unexpected error.› E241110 06:31:19.057697 133727 kv/kvserver/queue.go:1197 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 3694 + E241110 06:31:19.057697 133727 kv/kvserver/queue.go:1197 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 3694 +‹Please check the public issue tracker to check whether this problem is› E241110 06:31:19.057697 133727 kv/kvserver/queue.go:1197 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 3694 +‹already tracked. If you cannot find it there, please report the error› E241110 06:31:19.057697 133727 kv/kvserver/queue.go:1197 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 3694 +‹with details by creating a new issue.› E241110 06:31:19.057697 133727 kv/kvserver/queue.go:1197 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 3694 + E241110 06:31:19.057697 133727 kv/kvserver/queue.go:1197 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 3694 +‹If you would rather not post publicly, please contact us direct ly› E241110 06:31:19.057697 133727 kv/kvserver/queue.go:1197 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 3694 +‹using the support form.› E241110 06:31:19.057697 133727 kv/kvserver/queue.go:1197 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 3694 + E241110 06:31:19.057697 133727 kv/kvserver/queue.go:1197 ⋮ [T1,Vsystem,n2,replicate,s2,r69/9:‹/{Table/108-Max}›] 3694 +‹We appreciate your feedback.› ```

kvoli commented 2 hours ago

Generally, these errors only crop up when the system is extremely slow and there is a large time delta between starting a replication action, like removal here, and then that action actually taking place. In this case, the action failed earlier but on a retry the coordinator (leaseholder) assumed a stale replicaID for the coordinating non-voter:

(n8,s8):10NON_VOTER in sender descriptor r69:‹/{Table/108-Max}› [(n2,s2):9, (n3,s3):2, (n1,s1):8, (n7,s7):6NON_VOTER, (n8,s8):11NON_VOTER, next=12, gen=33]

Note how the ReplicaID is different, 10 vs 11. This does appear related to the test failure, as the trace shows the same rangeID.

The normal logs also suggest overload:

W241110 06:31:18.805031 107180 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n3,rnode=2,raddr=‹127.0.0.1:34273›,class=default,rpc] 3672  latency jump (prev avg 34.34ms, current 71.01ms)
W241110 06:31:18.805179 18932 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n9,rnode=1,raddr=‹127.0.0.1:35975›,class=rangefeed,rpc] 3673  latency jump (prev avg 38.22ms, current 69.87ms)
W241110 06:31:18.842337 12457 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n7,rnode=2,raddr=‹127.0.0.1:34273›,class=system,rpc] 3674  latency jump (prev avg 29.20ms, current 98.95ms)
W241110 06:31:18.842394 69560 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n3,rnode=8,raddr=‹127.0.0.1:46149›,class=system,rpc] 3675  latency jump (prev avg 33.54ms, current 98.85ms)
W241110 06:31:18.842443 77407 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n4,rnode=5,raddr=‹127.0.0.1:37669›,class=rangefeed,rpc] 3676  latency jump (prev avg 33.32ms, current 98.71ms)
W241110 06:31:18.842468 9702 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n4,rnode=6,raddr=‹127.0.0.1:36015›,class=system,rpc] 3677  latency jump (prev avg 44.99ms, current 99.24ms)
W241110 06:31:18.864191 10194 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n6,rnode=4,raddr=‹127.0.0.1:33547›,class=rangefeed,rpc] 3678  latency jump (prev avg 33.02ms, current 80.50ms)

I'd suggest lowering the load somehow.

arulajmani commented 2 hours ago

This seems very similar to what we saw here: https://github.com/cockroachdb/cockroach/issues/133516#issuecomment-2445212381. @rafiss, as a pointer, it might make sense to extend wait-for-zone-config-changes to also ensure that a replica was successfully upreplicated.

Assigning it back to SQL foundations as the owners of this test, but I'll remove the release blocker label given it isn't one.

cockroachdb / cockroach

ccl/multiregionccl: TestMultiRegionDataDriven_regional_by_table failed [trying to remove a replica that doesn't exist] #134795