cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.95k stars 3.79k forks source link

roachtest: perturbation/metamorphic/decommission failed #131287

Open cockroach-teamcity opened 5 days ago

cockroach-teamcity commented 5 days ago

roachtest.perturbation/metamorphic/decommission failed with artifacts on master @ a055dec86299452f1951416c16f5938e523f8559:

(assertions.go:363).Fail: 
    Error Trace:    github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/admission_control_latency.go:733
                                main/pkg/cmd/roachtest/test_runner.go:1285
                                src/runtime/asm_amd64.s:1695
    Error:          Should be true
    Test:           perturbation/metamorphic/decommission
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/decommission/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-42476

cockroach-teamcity commented 4 days ago

roachtest.perturbation/metamorphic/decommission failed with artifacts on master @ 41084720464c4144f64d9ddcb46508b4d762c4e8:

(assertions.go:363).Fail: 
    Error Trace:    github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/admission_control_latency.go:733
                                main/pkg/cmd/roachtest/test_runner.go:1284
                                src/runtime/asm_amd64.s:1695
    Error:          Should be true
    Test:           perturbation/metamorphic/decommission
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/decommission/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

andrewbaptist commented 4 days ago

This appears to be a real issue with a cluster "slowdown" at ~13:19:25, although I'm not sure the full cause. From the grafana logs or this, there is definitely a slowdown at the time. The "most impacted" nodes are n27 and n29. Looking at logs from the time we see this logging which indicates there was slowness on the node, but the CPU is not overly high.

I240924 13:19:25.295978 1603666 1@jobs/structured_log.go:60 ⋮ [T1,Vsystem,n29] 507 ={"Timestamp":1727183965295974608,"EventType":"status_change","JobID":1006349461814018077,"JobType":"AUTO CREATE STATS","Description":"status changed to: succeeded","PreviousStatus":"running","NewStatus":"succeeded","RunNum":1}
I240924 13:19:25.406736 1629760 kv/kvserver/replica_raftstorage.go:601 ⋮ [T1,Vsystem,n29,s58,r190/5:‹/Table/106/1/-1{94827…-29102…}›] 508  applied snapshot 86119064 from (n27,s54):4 at applied index 356857 (total=8ms data=430 MiB excise=true ingestion=9@7ms)
W240924 13:19:25.653903 1630024 kv/kvserver/closedts/sidetransport/receiver.go:140 ⋮ [n29,remote=27] 509  closed timestamps side-transport connection dropped from node: 27 (grpc: ‹context canceled› [code 1/Canceled])
E240924 13:19:25.794056 1429499 kv/kvclient/kvcoord/dist_sender.go:2705 ⋮ [T1,Vsystem,n29,client=10.142.0.224:55736,hostssl,user=‹roachprod›] 510  attempting Get [/Table/106/1/‹-1847189113840296792›/‹0›], [txn: 57789301], [can-forward-ts], [max_span_request_keys: 0], [target_bytes: 10485760] to route to leaseholder, but the leaseholder is unknown ‹et:desc:r190:/Table/106/1/-1{948279515324011160-291028938338912812} [(n2,s4):1, (n27,s54):4VOTER_DEMOTING_LEARNER, (n12,s23):3, (n29,s58):5VOTER_INCOMING, next=6, gen=54], lease:<empty>›
E240924 13:19:25.796496 1389888 kv/kvclient/kvcoord/dist_sender.go:2705 ⋮ [T1,Vsystem,n29,client=10.142.0.229:46532,hostssl,user=‹roachprod›] 511  attempting Put [/Table/106/1/‹-1516357307523256090›/‹0›], EndTxn(parallel commit) [/Table/106/1/‹-1516357307523256090›/‹0›], [txn: 7a2dbb4b], [can-forward-ts] to route to leaseholder, but the leaseholder is unknown ‹et:desc:r190:/Table/106/1/-1{948279515324011160-291028938338912812} [(n2,s4):1, (n27,s54):4VOTER_DEMOTING_LEARNER, (n12,s23):3, (n29,s58):5VOTER_INCOMING, next=6, gen=54], lease:<empty>›
E240924 13:19:25.908827 1389888 kv/kvclient/kvcoord/dist_sender.go:2705 ⋮ [T1,Vsystem,n29,client=10.142.0.229:46532,hostssl,user=‹roachprod›] 512  attempting Put [/Table/106/1/‹-1516357307523256090›/‹0›], EndTxn(parallel commit) [/Table/106/1/‹-1516357307523256090›/‹0›], [txn: 7a2dbb4b], [can-forward-ts] to route to leaseholder, but the leaseholder is unknown ‹et:desc:r190:/Table/106/1/-1{948279515324011160-291028938338912812} [(n2,s4):1, (n27,s54):4VOTER_DEMOTING_LEARNER, (n12,s23):3, (n29,s58):5VOTER_INCOMING, next=6, gen=54], lease:<empty>›
E240924 13:19:26.129494 1462346 kv/kvclient/kvcoord/dist_sender.go:2705 ⋮ [T1,Vsystem,n29,client=10.142.0.224:45638,hostssl,user=‹roachprod›] 513  attempting Get [/Table/106/1/‹-1807450774131348699›/‹0›], [txn: 2f638bae], [can-forward-ts], [max_span_request_keys: 0], [target_bytes: 10485760] to route to leaseholder, but the leaseholder is unknown ‹et:desc:r190:/Table/106/1/-1{948279515324011160-291028938338912812} [(n2,s4):1, (n27,s54):4VOTER_DEMOTING_LEARNER, (n12,s23):3, (n29,s58):5VOTER_INCOMING, next=6, gen=54], lease:<empty>›
E240924 13:19:26.189593 1528203 kv/kvclient/kvcoord/dist_sender.go:2705 ⋮ [T1,Vsystem,n29,client=10.142.0.224:56968,hostssl,user=‹roachprod›] 514  attempting Put [/Table/106/1/‹-1846315717432478801›/‹0›], EndTxn(parallel commit) [/Table/106/1/‹-1846315717432478801›/‹0›], [txn: 1409c872], [can-forward-ts] to route to leaseholder, but the leaseholder is unknown ‹et:desc:r190:/Table/106/1/-1{948279515324011160-291028938338912812} [(n2,s4):1, (n27,s54):4VOTER_DEMOTING_LEARNER, (n12,s23):3, (n29,s58):5VOTER_INCOMING, next=6, gen=54], lease:<empty>›
E240924 13:19:26.190686 1590187 kv/kvclient/kvcoord/dist_sender.go:2705 ⋮ [T1,Vsystem,n29,client=10.142.0.224:58084,hostssl,user=‹roachprod›] 515  attempting Put [/Table/106/1/‹-1728732515854504912›/‹0›], EndTxn(parallel commit) [/Table/106/1/‹-1728732515854504912›/‹0›], [txn: 812c8fe0], [can-forward-ts] to route to leaseholder, but the leaseholder is unknown ‹et:desc:r190:/Table/106/1/-1{948279515324011160-291028938338912812} [(n2,s4):1, (n27,s54):4VOTER_DEMOTING_LEARNER, (n12,s23):3, (n29,s58):5VOTER_INCOMING, next=6, gen=54], lease:<empty>›
E240924 13:19:26.269492 1462368 kv/kvclient/kvcoord/dist_sender.go:2705 ⋮ [T1,Vsystem,n29,client=10.142.0.224:45642,hostssl,user=‹roachprod›] 516  attempting Put [/Table/106/1/‹-1293724317752998877›/‹0›], EndTxn(parallel commit) [/Table/106/1/‹-1293724317752998877›/‹0›], [txn: a421893a], [can-forward-ts] to route to leaseholder, but the leaseholder is unknown ‹et:desc:r190:/Table/106/1/-1{948279515324011160-291028938338912812} [(n2,s4):1, (n27,s54):4VOTER_DEMOTING_LEARNER, (n12,s23):3, (n29,s58):5VOTER_INCOMING, next=6, gen=54], lease:<empty>›
E240924 13:19:26.270869 1462501 kv/kvclient/kvcoord/dist_sender.go:2705 ⋮ [T1,Vsystem,n29,client=10.142.0.224:45692,hostssl,user=‹roachprod›] 517  attempting Put [/Table/106/1/‹-1722843347266742932›/‹0›], EndTxn(parallel commit) [/Table/106/1/‹-1722843347266742932›/‹0›], [txn: 136a2158], [can-forward-ts] to route to leaseholder, but the leaseholder is unknown ‹et:desc:r190:/Table/106/1/-1{948279515324011160-291028938338912812} [(n2,s4):1, (n27,s54):4VOTER_DEMOTING_LEARNER, (n12,s23):3, (n29,s58):5VOTER_INCOMING, next=6, gen=54], lease:<empty>›
E240924 13:19:26.500146 1178664 kv/kvclient/kvcoord/dist_sender.go:2705 ⋮ [T1,Vsystem,n29,client=10.142.0.229:52946,hostssl,user=‹roachprod›] 518  attempting Put [/Table/106/1/‹-1789407889894350359›/‹0›], EndTxn(parallel commit) [/Table/106/1/‹-1789407889894350359›/‹0›], [txn: a446b3d4], [can-forward-ts] to route to leaseholder, but the leaseholder is unknown ‹et:desc:r190:/Table/106/1/-1{948279515324011160-291028938338912812} [(n2,s4):1, (n27,s54):4VOTER_DEMOTING_LEARNER, (n12,s23):3, (n29,s58):5VOTER_INCOMING, next=6, gen=54], lease:<empty>›
W240924 13:19:26.660635 612 2@kv/kvserver/replica_proposal.go:623 ⋮ [T1,Vsystem,n29,s58,r190/5:‹/Table/106/1/-1{94827…-29102…}›,raft] 519  applied lease after ~1.22s replication lag, client traffic may have been delayed [lease=repl=(n29,s58):5VOTER_INCOMING seq=12 start=1727183965.443681227,0 exp=1727183971.442220124,0 pro=1727183965.442220124,0 prev=repl=(n27,s54):4 seq=11 start=1727183844.140593557,0 exp=1727183968.586358653,0 pro=1727183962.586358653,0 acquisition-type=Transfer]

There is CPU AC throttling that occurs around this time due to high runnable goroutines, but I don't know why they are high. From one of the statement bundles 2024-09-24T13_19_26Z-1006349736163442694.zip we see

     4.291ms      0.000ms                                [local proposal: {count: 1, duration 2ms}]
     4.316ms      0.024ms                                event:server/node.go:1601 [n29] node received request: 1 Put, 1 EndTxn
     6.027ms      1.711ms                                    === operation:admissionWorkQueueWait _verbose:1 node:29
   118.542ms    112.515ms                                    event:wait_duration_nanos:110534565 queue_kind:"kv-regular-cpu-queue" work_priority:"normal-pri"
   116.561ms     -1.981ms                                    structured:{"@type":"type.googleapis.com/cockroach.util.admission.admissionpb.AdmissionWorkQueueStats","waitDurationNanos":"0.110534565s","queueKind":"kv-regular-cpu-queue","workPriority":"normal-pri"}
cockroach-teamcity commented 4 days ago

roachtest.perturbation/metamorphic/decommission failed with artifacts on master @ 8e6e4090457565a41bc3bd8ea954e437030d1c49:

(assertions.go:363).Fail: 
    Error Trace:    github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/admission_control_latency.go:733
                                main/pkg/cmd/roachtest/test_runner.go:1284
                                src/runtime/asm_amd64.s:1695
    Error:          Should be true
    Test:           perturbation/metamorphic/decommission
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/decommission/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!