Open cockroach-teamcity opened 5 days ago
roachtest.perturbation/metamorphic/decommission failed with artifacts on master @ 41084720464c4144f64d9ddcb46508b4d762c4e8:
(assertions.go:363).Fail:
Error Trace: github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/admission_control_latency.go:733
main/pkg/cmd/roachtest/test_runner.go:1284
src/runtime/asm_amd64.s:1695
Error: Should be true
Test: perturbation/metamorphic/decommission
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/decommission/run_1
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_runtimeAssertionsBuild=false
ROACHTEST_ssd=1
See: roachtest README
See: How To Investigate (internal)
See: Grafana
This appears to be a real issue with a cluster "slowdown" at ~13:19:25, although I'm not sure the full cause. From the grafana logs or this, there is definitely a slowdown at the time. The "most impacted" nodes are n27 and n29. Looking at logs from the time we see this logging which indicates there was slowness on the node, but the CPU is not overly high.
I240924 13:19:25.295978 1603666 1@jobs/structured_log.go:60 ⋮ [T1,Vsystem,n29] 507 ={"Timestamp":1727183965295974608,"EventType":"status_change","JobID":1006349461814018077,"JobType":"AUTO CREATE STATS","Description":"status changed to: succeeded","PreviousStatus":"running","NewStatus":"succeeded","RunNum":1}
I240924 13:19:25.406736 1629760 kv/kvserver/replica_raftstorage.go:601 ⋮ [T1,Vsystem,n29,s58,r190/5:‹/Table/106/1/-1{94827…-29102…}›] 508 applied snapshot 86119064 from (n27,s54):4 at applied index 356857 (total=8ms data=430 MiB excise=true ingestion=9@7ms)
W240924 13:19:25.653903 1630024 kv/kvserver/closedts/sidetransport/receiver.go:140 ⋮ [n29,remote=27] 509 closed timestamps side-transport connection dropped from node: 27 (grpc: ‹context canceled› [code 1/Canceled])
E240924 13:19:25.794056 1429499 kv/kvclient/kvcoord/dist_sender.go:2705 ⋮ [T1,Vsystem,n29,client=10.142.0.224:55736,hostssl,user=‹roachprod›] 510 attempting Get [/Table/106/1/‹-1847189113840296792›/‹0›], [txn: 57789301], [can-forward-ts], [max_span_request_keys: 0], [target_bytes: 10485760] to route to leaseholder, but the leaseholder is unknown ‹et:desc:r190:/Table/106/1/-1{948279515324011160-291028938338912812} [(n2,s4):1, (n27,s54):4VOTER_DEMOTING_LEARNER, (n12,s23):3, (n29,s58):5VOTER_INCOMING, next=6, gen=54], lease:<empty>›
E240924 13:19:25.796496 1389888 kv/kvclient/kvcoord/dist_sender.go:2705 ⋮ [T1,Vsystem,n29,client=10.142.0.229:46532,hostssl,user=‹roachprod›] 511 attempting Put [/Table/106/1/‹-1516357307523256090›/‹0›], EndTxn(parallel commit) [/Table/106/1/‹-1516357307523256090›/‹0›], [txn: 7a2dbb4b], [can-forward-ts] to route to leaseholder, but the leaseholder is unknown ‹et:desc:r190:/Table/106/1/-1{948279515324011160-291028938338912812} [(n2,s4):1, (n27,s54):4VOTER_DEMOTING_LEARNER, (n12,s23):3, (n29,s58):5VOTER_INCOMING, next=6, gen=54], lease:<empty>›
E240924 13:19:25.908827 1389888 kv/kvclient/kvcoord/dist_sender.go:2705 ⋮ [T1,Vsystem,n29,client=10.142.0.229:46532,hostssl,user=‹roachprod›] 512 attempting Put [/Table/106/1/‹-1516357307523256090›/‹0›], EndTxn(parallel commit) [/Table/106/1/‹-1516357307523256090›/‹0›], [txn: 7a2dbb4b], [can-forward-ts] to route to leaseholder, but the leaseholder is unknown ‹et:desc:r190:/Table/106/1/-1{948279515324011160-291028938338912812} [(n2,s4):1, (n27,s54):4VOTER_DEMOTING_LEARNER, (n12,s23):3, (n29,s58):5VOTER_INCOMING, next=6, gen=54], lease:<empty>›
E240924 13:19:26.129494 1462346 kv/kvclient/kvcoord/dist_sender.go:2705 ⋮ [T1,Vsystem,n29,client=10.142.0.224:45638,hostssl,user=‹roachprod›] 513 attempting Get [/Table/106/1/‹-1807450774131348699›/‹0›], [txn: 2f638bae], [can-forward-ts], [max_span_request_keys: 0], [target_bytes: 10485760] to route to leaseholder, but the leaseholder is unknown ‹et:desc:r190:/Table/106/1/-1{948279515324011160-291028938338912812} [(n2,s4):1, (n27,s54):4VOTER_DEMOTING_LEARNER, (n12,s23):3, (n29,s58):5VOTER_INCOMING, next=6, gen=54], lease:<empty>›
E240924 13:19:26.189593 1528203 kv/kvclient/kvcoord/dist_sender.go:2705 ⋮ [T1,Vsystem,n29,client=10.142.0.224:56968,hostssl,user=‹roachprod›] 514 attempting Put [/Table/106/1/‹-1846315717432478801›/‹0›], EndTxn(parallel commit) [/Table/106/1/‹-1846315717432478801›/‹0›], [txn: 1409c872], [can-forward-ts] to route to leaseholder, but the leaseholder is unknown ‹et:desc:r190:/Table/106/1/-1{948279515324011160-291028938338912812} [(n2,s4):1, (n27,s54):4VOTER_DEMOTING_LEARNER, (n12,s23):3, (n29,s58):5VOTER_INCOMING, next=6, gen=54], lease:<empty>›
E240924 13:19:26.190686 1590187 kv/kvclient/kvcoord/dist_sender.go:2705 ⋮ [T1,Vsystem,n29,client=10.142.0.224:58084,hostssl,user=‹roachprod›] 515 attempting Put [/Table/106/1/‹-1728732515854504912›/‹0›], EndTxn(parallel commit) [/Table/106/1/‹-1728732515854504912›/‹0›], [txn: 812c8fe0], [can-forward-ts] to route to leaseholder, but the leaseholder is unknown ‹et:desc:r190:/Table/106/1/-1{948279515324011160-291028938338912812} [(n2,s4):1, (n27,s54):4VOTER_DEMOTING_LEARNER, (n12,s23):3, (n29,s58):5VOTER_INCOMING, next=6, gen=54], lease:<empty>›
E240924 13:19:26.269492 1462368 kv/kvclient/kvcoord/dist_sender.go:2705 ⋮ [T1,Vsystem,n29,client=10.142.0.224:45642,hostssl,user=‹roachprod›] 516 attempting Put [/Table/106/1/‹-1293724317752998877›/‹0›], EndTxn(parallel commit) [/Table/106/1/‹-1293724317752998877›/‹0›], [txn: a421893a], [can-forward-ts] to route to leaseholder, but the leaseholder is unknown ‹et:desc:r190:/Table/106/1/-1{948279515324011160-291028938338912812} [(n2,s4):1, (n27,s54):4VOTER_DEMOTING_LEARNER, (n12,s23):3, (n29,s58):5VOTER_INCOMING, next=6, gen=54], lease:<empty>›
E240924 13:19:26.270869 1462501 kv/kvclient/kvcoord/dist_sender.go:2705 ⋮ [T1,Vsystem,n29,client=10.142.0.224:45692,hostssl,user=‹roachprod›] 517 attempting Put [/Table/106/1/‹-1722843347266742932›/‹0›], EndTxn(parallel commit) [/Table/106/1/‹-1722843347266742932›/‹0›], [txn: 136a2158], [can-forward-ts] to route to leaseholder, but the leaseholder is unknown ‹et:desc:r190:/Table/106/1/-1{948279515324011160-291028938338912812} [(n2,s4):1, (n27,s54):4VOTER_DEMOTING_LEARNER, (n12,s23):3, (n29,s58):5VOTER_INCOMING, next=6, gen=54], lease:<empty>›
E240924 13:19:26.500146 1178664 kv/kvclient/kvcoord/dist_sender.go:2705 ⋮ [T1,Vsystem,n29,client=10.142.0.229:52946,hostssl,user=‹roachprod›] 518 attempting Put [/Table/106/1/‹-1789407889894350359›/‹0›], EndTxn(parallel commit) [/Table/106/1/‹-1789407889894350359›/‹0›], [txn: a446b3d4], [can-forward-ts] to route to leaseholder, but the leaseholder is unknown ‹et:desc:r190:/Table/106/1/-1{948279515324011160-291028938338912812} [(n2,s4):1, (n27,s54):4VOTER_DEMOTING_LEARNER, (n12,s23):3, (n29,s58):5VOTER_INCOMING, next=6, gen=54], lease:<empty>›
W240924 13:19:26.660635 612 2@kv/kvserver/replica_proposal.go:623 ⋮ [T1,Vsystem,n29,s58,r190/5:‹/Table/106/1/-1{94827…-29102…}›,raft] 519 applied lease after ~1.22s replication lag, client traffic may have been delayed [lease=repl=(n29,s58):5VOTER_INCOMING seq=12 start=1727183965.443681227,0 exp=1727183971.442220124,0 pro=1727183965.442220124,0 prev=repl=(n27,s54):4 seq=11 start=1727183844.140593557,0 exp=1727183968.586358653,0 pro=1727183962.586358653,0 acquisition-type=Transfer]
There is CPU AC throttling that occurs around this time due to high runnable goroutines, but I don't know why they are high. From one of the statement bundles 2024-09-24T13_19_26Z-1006349736163442694.zip we see
4.291ms 0.000ms [local proposal: {count: 1, duration 2ms}]
4.316ms 0.024ms event:server/node.go:1601 [n29] node received request: 1 Put, 1 EndTxn
6.027ms 1.711ms === operation:admissionWorkQueueWait _verbose:1 node:29
118.542ms 112.515ms event:wait_duration_nanos:110534565 queue_kind:"kv-regular-cpu-queue" work_priority:"normal-pri"
116.561ms -1.981ms structured:{"@type":"type.googleapis.com/cockroach.util.admission.admissionpb.AdmissionWorkQueueStats","waitDurationNanos":"0.110534565s","queueKind":"kv-regular-cpu-queue","workPriority":"normal-pri"}
roachtest.perturbation/metamorphic/decommission failed with artifacts on master @ 8e6e4090457565a41bc3bd8ea954e437030d1c49:
(assertions.go:363).Fail:
Error Trace: github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/admission_control_latency.go:733
main/pkg/cmd/roachtest/test_runner.go:1284
src/runtime/asm_amd64.s:1695
Error: Should be true
Test: perturbation/metamorphic/decommission
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/decommission/run_1
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=8
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_runtimeAssertionsBuild=false
ROACHTEST_ssd=2
See: roachtest README
See: How To Investigate (internal)
See: Grafana
roachtest.perturbation/metamorphic/decommission failed with artifacts on master @ a055dec86299452f1951416c16f5938e523f8559:
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=4
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_runtimeAssertionsBuild=false
ROACHTEST_ssd=2
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
/cc @cockroachdb/kv-triageThis test on roachdash | Improve this report!
Jira issue: CRDB-42476