cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.13k stars 3.81k forks source link

roachtest: db-console/cypress failed #135143

Open cockroach-teamcity opened 2 days ago

cockroach-teamcity commented 2 days ago

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.db-console/cypress failed with artifacts on master @ 39e43b85ec3b02bc760df10fce1c19d09419d6f2:

(db_console.go:137).seedCluster: dial tcp 20.102.110.21:26257: connect: connection refused
test artifacts and logs in: /artifacts/db-console/cypress/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

/cc @cockroachdb/obs-prs

This test on roachdash | Improve this report!

Jira issue: CRDB-44371

cockroach-teamcity commented 2 days ago

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.db-console/cypress failed with artifacts on master @ 39e43b85ec3b02bc760df10fce1c19d09419d6f2:

(db_console.go:137).seedCluster: dial tcp 3.144.110.101:26257: connect: connection refused
test artifacts and logs in: /artifacts/db-console/cypress/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

This test on roachdash | Improve this report!

cockroach-teamcity commented 2 days ago

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.db-console/cypress failed with artifacts on master @ 39e43b85ec3b02bc760df10fce1c19d09419d6f2:

(db_console.go:137).seedCluster: read tcp 172.17.0.3:34796 -> 35.196.99.252:26257: read: connection reset by peer
test artifacts and logs in: /artifacts/db-console/cypress/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

dhartunian commented 2 days ago

I see this error on node 1 which seems completely unrelated:

E241114 08:20:20.651537 16780 (gostd) net/http/server.go:3416 ⋮ [-] 318  ‹http: TLS handshake error from 10.0.0.3:55430: remote error: tls: bad certificate›
E241114 08:20:28.870261 17115 (gostd) net/http/server.go:3416 ⋮ [-] 319  ‹http: TLS handshake error from 10.0.0.3:49538: remote error: tls: bad certificate›
E241114 08:20:35.652404 17365 (gostd) net/http/server.go:3416 ⋮ [-] 320  ‹http: TLS handshake error from 10.0.0.3:56574: remote error: tls: bad certificate›
E241114 08:20:43.869805 17585 (gostd) net/http/server.go:3416 ⋮ [-] 321  ‹http: TLS handshake error from 10.0.0.3:60280: remote error: tls: bad certificate›
E241114 08:20:50.653134 17801 (gostd) net/http/server.go:3416 ⋮ [-] 322  ‹http: TLS handshake error from 10.0.0.3:60292: remote error: tls: bad certificate›
E241114 08:20:58.869385 18071 (gostd) net/http/server.go:3416 ⋮ [-] 323  ‹http: TLS handshake error from 10.0.0.3:48848: remote error: tls: bad certificate›
E241114 08:21:05.652483 18287 (gostd) net/http/server.go:3416 ⋮ [-] 324  ‹http: TLS handshake error from 10.0.0.3:47858: remote error: tls: bad certificate›
W241114 08:21:09.152322 338 kv/kvserver/liveness/liveness.go:753 ⋮ [T1,Vsystem,n1,liveness-hb] 325  slow heartbeat took 3.002216064s; err=result is ambiguous: context done during DistSender.Send: ba: ‹ConditionalPut [/System/NodeLiveness/1], EndTxn(commit modified-span (node-liveness)) [/System/NodeLiveness/1], [txn: 25ae64bb], [can-forward-ts]› RPC error: grpc: ‹context deadline exceeded› [code 4/DeadlineExceeded]
W241114 08:21:09.152591 338 kv/kvserver/liveness/liveness.go:667 ⋮ [T1,Vsystem,n1,liveness-hb] 326  failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 3.002s (given timeout 3s): result is ambiguous: context done during DistSender.Send: ba: ‹ConditionalPut [/System/NodeLiveness/1], EndTxn(commit modified-span (node-liveness)) [/System/NodeLiveness/1], [txn: 25ae64bb], [can-forward-ts]› RPC error: grpc: ‹context deadline exceeded› [code 4/DeadlineExceeded]
W241114 08:21:09.152591 338 kv/kvserver/liveness/liveness.go:667 â‹® [T1,Vsystem,n1,liveness-hb] 326 +
W241114 08:21:09.152591 338 kv/kvserver/liveness/liveness.go:667 â‹® [T1,Vsystem,n1,liveness-hb] 326 +An inability to maintain liveness will prevent a node from participating in a
W241114 08:21:09.152591 338 kv/kvserver/liveness/liveness.go:667 â‹® [T1,Vsystem,n1,liveness-hb] 326 +cluster. If this problem persists, it may be a sign of resource starvation or
W241114 08:21:09.152591 338 kv/kvserver/liveness/liveness.go:667 â‹® [T1,Vsystem,n1,liveness-hb] 326 +of network connectivity problems. For help troubleshooting, visit:
W241114 08:21:09.152591 338 kv/kvserver/liveness/liveness.go:667 â‹® [T1,Vsystem,n1,liveness-hb] 326 +
W241114 08:21:09.152591 338 kv/kvserver/liveness/liveness.go:667 â‹® [T1,Vsystem,n1,liveness-hb] 326 +    https://www.cockroachlabs.com/docs/stable/cluster-setup-troubleshooting.html#node-liveness-issues
W241114 08:21:11.370231 7992 kv/kvserver/closedts/sidetransport/receiver.go:135 ⋮ [n1,remote=4] 327  closed timestamps side-transport connection dropped from node: 4 (grpc: ‹context canceled› [code 1/Canceled])
W241114 08:21:11.489699 5033 kv/kvserver/raft_transport.go:1067 ⋮ [T1,Vsystem,n1] 328  while processing outgoing Raft queue to node 4: recv msg error: grpc: ‹grpc: the client connection is closing› [code 1/Canceled]:
E241114 08:21:11.489775 3906 2@rpc/peer.go:642 ⋮ [T1,Vsystem,n1,rnode=4,raddr=‹10.142.3.18:26257›,class=system,rpc] 329  disconnected (was healthy for 1m56.104s): operation "conn heartbeat" timed out after 6.001s (given timeout 6s): grpc: ‹context deadline exceeded› [code 4/DeadlineExceeded]

The connection error happens when we try to seed the data:

read tcp 172.17.0.3:34796 -> 35.196.99.252:26257: read: connection reset by peer
(1) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*dbConsoleCypressTest).seedCluster
  |     pkg/cmd/roachtest/tests/db_console.go:137
  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*dbConsoleCypressTest).SetupTest
  |     pkg/cmd/roachtest/tests/db_console.go:99
  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runDbConsoleCypress
  |     pkg/cmd/roachtest/tests/db_console.go:235
  | main.(*testRunner).runTest.func2
  |     pkg/cmd/roachtest/test_runner.go:1305
  | runtime.goexit
  |     src/runtime/asm_amd64.s:1695
Wraps: (2) secondary error attachment
  | read tcp 172.17.0.3:34796 -> 35.196.99.252:26257: read: connection reset by peer
  | (1) read tcp 172.17.0.3:34796 -> 35.196.99.252:26257
  | Wraps: (2) read
  | Wraps: (3) connection reset by peer
  | Error types: (1) *net.OpError (2) *os.SyscallError (3) syscall.Errno
Wraps: (3) read tcp 172.17.0.3:34796 -> 35.196.99.252:26257: read: connection reset by peer
Error types: (1) *withstack.withStack (2) *secondary.withSecondaryError (3) *errutil.leafError

It looks like the issue might be that we're trying to connect to the workload node to run seed queries, @kyle-a-wong can confirm? (link below)

https://github.com/cockroachdb/cockroach/blob/9bf4f80a95abab25bbba7efca81cdbca842cdfa0/pkg/cmd/roachtest/tests/db_console.go#L230-L235