cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.19k stars 3.82k forks source link

roachtest: c2c/mixed-version failed #134281

Closed cockroach-teamcity closed 3 weeks ago

cockroach-teamcity commented 3 weeks ago

roachtest.c2c/mixed-version failed with artifacts on master @ 015b2f48cf80a6d8b60d7038c8c3457d934c716a:

(mixedversion.go:732).Run: source: preparing to run step 8: failed to get binary version for node 2 (system): context deadline exceeded
(mixedversion.go:732).Run: dest: cluster.StopE: one or more parallel execution failure(s): context canceled
test artifacts and logs in: /artifacts/c2c/mixed-version/cpu_arch=arm64/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-43993

dt commented 3 weeks ago

n2 is nowhere to be found in the logs. Did it get preempted or something?

blathers-crl[bot] commented 3 weeks ago

cc @cockroachdb/test-eng

DarrylWong commented 3 weeks ago

Did it get preempted or something?

This is on azure which doesn't support spot VMs yet.

srosenberg commented 3 weeks ago

Something happened to n2,

I241105 06:52:11.107020 30981 rpc/heartbeat.go:174 â‹® [-] 734  failing ping request from node n2
E241105 06:52:11.107518 25897 kv/kvserver/replica_consistency.go:764 ⋮ [T1,Vsystem,n1,s1,r99/4:‹/Tenant/3/Table/11{2/1…-4/1…}›] 735  checksum computation failed: context canceled
I241105 06:52:11.151482 30977 rpc/heartbeat.go:174 â‹® [-] 736  failing ping request from node n2
W241105 06:52:11.171362 2764 kv/kvserver/closedts/sidetransport/sender.go:838 ⋮ [T1,Vsystem,n1,ctstream=2] 737  failed to send closed timestamp message 601 to n2: send msg error: ‹EOF›
I241105 06:52:11.803854 31033 rpc/heartbeat.go:174 â‹® [-] 738  failing ping request from node n2
I241105 06:52:13.480847 29676 sql/stats/automatic_stats.go:865 ⋮ [T1,Vsystem,n1] 739  automatically executing ‹"CREATE STATISTICS __auto__ FROM [54] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
E241105 06:52:15.096939 1828 2@rpc/peer.go:668 ⋮ [T1,Vsystem,n1,rnode=2,raddr=‹10.1.0.156:26257›,class=system,rpc] 740  failed connection attempt‹ (last connected 4.001s ago)›: grpc: ‹connection error: desc = "transport: error while dialing: dial tcp 10.1.0.156:26257: i/o timeout"› [code 14/Unavailable]

On a first glance, there doesn't appear to be anything wrong with the test infra. Handing over to DR for further triage.

dt commented 3 weeks ago

there doesn't appear to be anything wrong with the test infra

@srosenberg Where are node 2's logs?

srosenberg commented 3 weeks ago

there doesn't appear to be anything wrong with the test infra

@srosenberg Where are node 2's logs?

Since n2 became (and stayed) unreachable during the test, its logs could not be downloaded.

dt commented 3 weeks ago

Is that the cockroach process or the vm that we're saying was unreachable?

srosenberg commented 3 weeks ago

Is that the cockroach process or the vm that we're saying was unreachable?

Definitely the vm was unreachable. I did scan other available logs, but nothing really stood out. All things considering, this is likely a transient issue in azure. Feel free to close it, assuming nothing else sticks out wrt what's being tested.