Open cockroach-teamcity opened 5 days ago
The full trace and some log snippets are below. The logs have a few messages about unexpected errors during replication, so checking if KV has any input on whether that is related to the failure.
I'm also attaching the test artifacts here so they aren't lost: 2024_11_10T06_23_47_000Z_pkg_ccl_multiregionccl_multiregionccl_test_test.outputs__outputs.zip
Generally, these errors only crop up when the system is extremely slow and there is a large time delta between starting a replication action, like removal here, and then that action actually taking place. In this case, the action failed earlier but on a retry the coordinator (leaseholder) assumed a stale replicaID for the coordinating non-voter:
(n8,s8):10NON_VOTER in sender descriptor r69:‹/{Table/108-Max}› [(n2,s2):9, (n3,s3):2, (n1,s1):8, (n7,s7):6NON_VOTER, (n8,s8):11NON_VOTER, next=12, gen=33]
Note how the ReplicaID
is different, 10 vs 11. This does appear related to the test failure, as the trace shows the same rangeID.
The normal logs also suggest overload:
W241110 06:31:18.805031 107180 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n3,rnode=2,raddr=‹127.0.0.1:34273›,class=default,rpc] 3672 latency jump (prev avg 34.34ms, current 71.01ms)
W241110 06:31:18.805179 18932 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n9,rnode=1,raddr=‹127.0.0.1:35975›,class=rangefeed,rpc] 3673 latency jump (prev avg 38.22ms, current 69.87ms)
W241110 06:31:18.842337 12457 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n7,rnode=2,raddr=‹127.0.0.1:34273›,class=system,rpc] 3674 latency jump (prev avg 29.20ms, current 98.95ms)
W241110 06:31:18.842394 69560 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n3,rnode=8,raddr=‹127.0.0.1:46149›,class=system,rpc] 3675 latency jump (prev avg 33.54ms, current 98.85ms)
W241110 06:31:18.842443 77407 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n4,rnode=5,raddr=‹127.0.0.1:37669›,class=rangefeed,rpc] 3676 latency jump (prev avg 33.32ms, current 98.71ms)
W241110 06:31:18.842468 9702 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n4,rnode=6,raddr=‹127.0.0.1:36015›,class=system,rpc] 3677 latency jump (prev avg 44.99ms, current 99.24ms)
W241110 06:31:18.864191 10194 2@rpc/clock_offset.go:286 ⋮ [T1,Vsystem,n6,rnode=4,raddr=‹127.0.0.1:33547›,class=rangefeed,rpc] 3678 latency jump (prev avg 33.02ms, current 80.50ms)
I'd suggest lowering the load somehow.
This seems very similar to what we saw here: https://github.com/cockroachdb/cockroach/issues/133516#issuecomment-2445212381. @rafiss, as a pointer, it might make sense to extend wait-for-zone-config-changes
to also ensure that a replica was successfully upreplicated.
Assigning it back to SQL foundations as the owners of this test, but I'll remove the release blocker label given it isn't one.
ccl/multiregionccl.TestMultiRegionDataDriven_regional_by_table failed on release-24.1 @ e5f8e9cc5c6fd09ac54a8cc087f9855e3e6d772d:
Parameters:
attempt=1
run=22
shard=4
Help
See also: How To Investigate a Go Test Failure (internal)
Same failure on other branches
- #132041 ccl/multiregionccl: TestMultiRegionDataDriven_regional_by_table failed [query is not served locally even though it's running on a non-voter with AOST] [C-bug C-test-failure O-robot P-2 T-kv branch-release-24.2]
/cc @cockroachdb/sql-foundations
This test on roachdash | Improve this report!
Jira issue: CRDB-44249