cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.11k stars 3.81k forks source link

retrying liveness update after storage.errRetryLiveness: result is ambiguous (context done during DistSender.Send: context deadline exceeded) #121818

Open chenqunjian opened 7 months ago

chenqunjian commented 7 months ago

Describe the problem

Please describe the issue you observed, and any steps we can take to reproduce it:

To Reproduce

What did you do? Describe in your own words.

If possible, provide steps to reproduce the behavior:

2: 172.17.0.115:6257 (5m8s: infos 567/1673 sent/received, bytes 563290B/1127097B sent/received) gossip server (0/3 cur/max conns, infos 567/1673 sent/received, bytes 563290B/1127097B sent/received) I240405 03:48:33.441362 1409 server/status/runtime.go:500 [n1] runtime stats: 9.0 GiB RSS, 600 goroutines, 346 MiB/537 MiB/1.1 GiB GO alloc/idle/total, 6.7 GiB/8.0 GiB CGO alloc/total, 45048.9 CGO/sec, 366.1/25.1 %(u/s)time, 0.1 %gc (9x), 6.5 MiB/96 MiB (r/w)net I240405 03:48:34.856313 1464 storage/node_liveness.go:775 [n1,hb] retrying liveness update after storage.errRetryLiveness: result is ambiguous (context done during DistSender.Send: context deadline exceeded) W240405 03:48:34.856345 1464 storage/node_liveness.go:523 [n1,hb] slow heartbeat took 4.5s W240405 03:48:34.856361 1464 storage/node_liveness.go:463 [n1,hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s I240405 03:48:37.856579 404130 internal/client/txn.go:618 [n1] async rollback failed: aborted during DistSender.Send: context deadline exceeded W240405 03:48:39.356597 1464 storage/node_liveness.go:523 [n1,hb] slow heartbeat took 4.5s I240405 03:48:39.356618 1464 storage/node_liveness.go:453 [n1,hb] heartbeat failed on epoch increment; retrying W240405 03:48:39.356661 404044 storage/nodeliveness.go:523 [n1,s1,r4317/3:/Table/346{-/1/"fc…}] slow heartbeat took 1.5s E240405 03:48:39.356671 404044 storage/replica_rangelease.go:292 [n1,s1,r4317/3:/Table/346{-/1/"fc…}] heartbeat failed on epoch increment W240405 03:48:39.356777 404322 storage/node_liveness.go:523 [n1,s1,r164/14:/System/tsd/cr.node.sql.…] slow heartbeat took 1.5s E240405 03:48:39.356792 404322 storage/replica_range_lease.go:292 [n1,s1,r164/14:/System/tsd/cr.node.sql.…] heartbeat failed on epoch increment W240405 03:48:39.357022 404282 storage/node_liveness.go:523 [n1,s1,r27921/1:/System/tsd/cr.node.sql.mem.s…] slow heartbeat took 1.5s E240405 03:48:39.357034 404282 storage/replica_range_lease.go:292 [n1,s1,r27921/1:/System/tsd/cr.node.sql.mem.s…] heartbeat failed on epoch increment W240405 03:48:39.357184 404283 storage/node_liveness.go:523 [n1,s1,r751160/3:/Table/604/1/"20220{301…-228…}] slow heartbeat took 1.5s E240405 03:48:39.357207 404283 storage/replica_range_lease.go:292 [n1,s1,r751160/3:/Table/604/1/"20220{301…-228…}] heartbeat failed on epoch increment W240405 03:48:39.357311 404373 storage/node_liveness.go:523 [n1,s1,r3072/3:/System/tsd/cr.store.t{o…-x…}] slow heartbeat took 1.5s E240405 03:48:39.357319 404373 storage/replica_range_lease.go:292 [n1,s1,r3072/3:/System/tsd/cr.store.t{o…-x…}] heartbeat failed on epoch increment W240405 03:48:39.357376 404045 storage/node_liveness.go:523 [n1,s1,r3121/1:/System/tsd/cr.store.queue.…] slow heartbeat took 1.5s E240405 03:48:39.357386 404045 storage/replica_range_lease.go:292 [n1,s1,r3121/1:/System/tsd/cr.store.queue.…] heartbeat failed on epoch increment W240405 03:48:39.357461 404374 storage/node_liveness.go:523 [n1,s1,r6/1:/Table/{SystemCon…-11}] slow heartbeat took 1.4s E240405 03:48:39.357471 404374 storage/replica_range_lease.go:292 [n1,s1,r6/1:/Table/{SystemCon…-11}] heartbeat failed on epoch increment W240405 03:48:39.357595 404018 storage/node_liveness.go:523 [n1,s1,r3088/3:/System/tsd/cr.node.txn.resta…] slow heartbeat took 1.4s E240405 03:48:39.357608 404018 storage/replica_range_lease.go:292 [n1,s1,r3088/3:/System/tsd/cr.node.txn.resta…] heartbeat failed on epoch increment W240405 03:48:39.357713 404357 storage/node_liveness.go:523 [n1,s1,r190/15:/System/tsd/cr.store.queue.…] slow heartbeat took 1.4s E240405 03:48:39.357722 404357 storage/replica_range_lease.go:292 [n1,s1,r190/15:/System/tsd/cr.store.queue.…] heartbeat failed on epoch increment W240405 03:48:39.357804 404046 storage/node_liveness.go:523 [n1,s1,r3165/3:/System/tsd/cr.node.sql.dists…] slow heartbeat took 1.4s E240405 03:48:39.357814 404046 storage/replica_range_lease.go:292 [n1,s1,r3165/3:/System/tsd/cr.node.sql.dists…] heartbeat failed on epoch increment W240405 03:48:39.357873 404375 storage/node_liveness.go:523 [n1,s1,r6398/3:/Table/36{8-9}] slow heartbeat took 1.4s E240405 03:48:39.357880 404375 storage/replica_range_lease.go:292 [n1,s1,r6398/3:/Table/36{8-9}] heartbeat failed on epoch increment W240405 03:48:39.357944 404483 storage/nodeliveness.go:523 [n1,s1,r4330/3:/Table/359{-/1/"fc…}] slow heartbeat took 1.4s E240405 03:48:39.357952 404483 storage/replica_rangelease.go:292 [n1,s1,r4330/3:/Table/359{-/1/"fc…}] heartbeat failed on epoch increment W240405 03:48:39.358033 404047 storage/node_liveness.go:523 [n1,s1,r187/9:/System/tsd/cr.node.sys.…] slow heartbeat took 1.4s E240405 03:48:39.358045 404047 storage/replica_range_lease.go:292 [n1,s1,r187/9:/System/tsd/cr.node.sys.…] heartbeat failed on epoch increment W240405 03:48:39.358117 404299 storage/node_liveness.go:523 [n1,s1,r3110/3:/System/tsd/cr.store.{ad…-co…}] slow heartbeat took 1.4s E240405 03:48:39.358123 404299 storage/replica_range_lease.go:292 [n1,s1,r3110/3:/System/tsd/cr.store.{ad…-co…}] heartbeat failed on epoch increment W240405 03:48:39.358203 404499 storage/node_liveness.go:523 [n1,s1,r2228/3:/Table/25{3-4}] slow heartbeat took 1.4s E240405 03:48:39.358212 404499 storage/replica_range_lease.go:292 [n1,s1,r2228/3:/Table/25{3-4}] heartbeat failed on epoch increment W240405 03:48:39.965638 211 storage/closedts/provider/provider.go:156 [ct-closer] unable to move closed timestamp forward: not live I240405 03:48:42.357132 404424 internal/client/txn.go:618 [n1] async rollback failed: aborted during DistSender.Send: context deadline exceeded I240405 03:48:43.445241 1409 server/status/runtime.go:500 [n1] runtime stats: 9.0 GiB RSS, 640 goroutines, 520 MiB/409 MiB/1.1 GiB GO alloc/idle/total, 6.7 GiB/8.0 GiB CGO alloc/total, 69073.3 CGO/sec, 265.9/23.2 %(u/s)time, 0.0 %gc (4x), 4.7 MiB/4.5 MiB (r/w)net W240405 03:48:43.607655 1459 server/node.go:880 [n1,summaries] health alerts detected: {Alerts:[{StoreID:0 Category:METRICS Description:liveness.heartbeatfailures Value:17 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} I240405 03:48:43.857031 1464 storage/node_liveness.go:775 [n1,hb] retrying liveness update after storage.errRetryLiveness: result is ambiguous (context done during DistSender.Send: context deadline exceeded) W240405 03:48:43.857051 1464 storage/node_liveness.go:523 [n1,hb] slow heartbeat took 4.5s W240405 03:48:43.857063 1464 storage/node_liveness.go:463 [n1,hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s W240405 03:48:46.839366 404731 storage/node_liveness.go:523 [n1,s1,r3165/3:/System/tsd/cr.node.sql.dists…] slow heartbeat took 4.5s W240405 03:48:46.839482 404836 storage/node_liveness.go:523 [n1,s1,r3088/3:/System/tsd/cr.node.txn.resta…] slow heartbeat took 4.5s W240405 03:48:46.839659 404603 storage/nodeliveness.go:523 [n1,s1,r4317/3:/Table/346{-/1/"fc…}] slow heartbeat took 4.4s W240405 03:48:46.839738 404837 storage/node_liveness.go:523 [n1,s1,r34172/3:/Table/76/1/"2号机1"/"202…] slow heartbeat took 4.4s W240405 03:48:46.839834 404492 storage/node_liveness.go:523 [n1,s1,r164/14:/System/tsd/cr.node.sql.…] slow heartbeat took 4.4s W240405 03:48:46.839949 404838 storage/node_liveness.go:523 [n1,s1,r27921/1:/System/tsd/cr.node.sql.mem.s…] slow heartbeat took 4.4s W240405 03:48:46.840022 404804 storage/node_liveness.go:523 [n1,s1,r3121/1:/System/tsd/cr.store.queue.…] slow heartbeat took 4.4s W240405 03:48:46.840124 404683 storage/node_liveness.go:523 [n1,s1,r6398/3:/Table/36{8-9}] slow heartbeat took 4.4s W240405 03:48:46.840210 404851 storage/node_liveness.go:523 [n1,s1,r6/1:/Table/{SystemCon…-11}] slow heartbeat took 4.4s W240405 03:48:46.840292 404493 storage/node_liveness.go:523 [n1,s1,r227610/3:/Table/1649/1/"2021121{2…-1…}] slow heartbeat took 4.4s W240405 03:48:46.840414 404852 storage/node_liveness.go:523 [n1,s1,r3110/3:/System/tsd/cr.store.{ad…-co…}] slow heartbeat took 4.4s W240405 03:48:46.840484 403856 storage/nodeliveness.go:523 [n1,s1,r4330/3:/Table/359{-/1/"fc…}] slow heartbeat took 4.4s W240405 03:48:46.840576 404573 storage/node_liveness.go:523 [n1,s1,r2228/3:/Table/25{3-4}] slow heartbeat took 4.4s W240405 03:48:46.840646 404574 storage/node_liveness.go:523 [n1,s1,r190/15:/System/tsd/cr.store.queue.…] slow heartbeat took 4.4s W240405 03:48:46.840775 404867 storage/node_liveness.go:523 [n1,s1,r323/18:/Table/13{7-8}] slow heartbeat took 4.3s W240405 03:48:46.840934 404287 storage/node_liveness.go:523 [n1,s1,r3072/3:/System/tsd/cr.store.t{o…-x…}] slow heartbeat took 3.6s W240405 03:48:46.841032 404634 storage/node_liveness.go:523 [n1,s1,r187/9:/System/tsd/cr.node.sys.…] slow heartbeat took 3.6s W240405 03:48:46.841236 404684 storage/node_liveness.go:523 [n1,s1,r751160/3:/Table/604/1/"20220{301…-228…}] slow heartbeat took 3.5s W240405 03:48:46.841393 1464 storage/node_liveness.go:523 [n1,hb] slow heartbeat took 3.0s W240405 03:48:46.841422 404636 storage/node_liveness.go:523 [n1,s1,r759173/3:/Table/1300/5/"{59-Y-…-NC-02…}] slow heartbeat took 1.4s W240405 03:48:46.841522 404699 storage/node_liveness.go:523 [n1,s1,r1775/3:/Table/197{-/1/"dee…}] slow heartbeat took 1.1s I240405 03:48:46.857380 404774 internal/client/txn.go:618 [n1] async rollback failed: aborted during DistSender.Send: context deadline exceeded W240405 03:48:52.374268 151 gossip/gossip.go:1501 [n1] first range unavailable; trying remaining resolvers I240405 03:48:52.374490 409797 gossip/client.go:128 [n1] started gossip client to 172.17.0.116:6257 I240405 03:48:52.856224 1464 storage/node_liveness.go:775 [n1,hb] retrying liveness update after storage.errRetryLiveness: result is ambiguous (context done during DistSender.Send: context deadline exceeded) W240405 03:48:52.856242 1464 storage/node_liveness.go:523 [n1,hb] slow heartbeat took 4.5s W240405 03:48:52.856254 1464 storage/node_liveness.go:463 [n1,hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s I240405 03:48:53.449009 1409 server/status/runtime.go:500 [n1] runtime stats: 9.1 GiB RSS, 650 goroutines, 379 MiB/509 MiB/1.1 GiB GO alloc/idle/total, 6.8 GiB/8.1 GiB CGO alloc/total, 32676.3 CGO/sec, 369.6/24.9 %(u/s)time, 0.1 %gc (11x), 20 MiB/43 MiB (r/w)net I240405 03:48:55.856629 410135 internal/client/txn.go:618 [n1] async rollback failed: aborted during DistSender.Send: context deadline exceeded I240405 03:48:57.356580 1464 storage/node_liveness.go:775 [n1,hb] retrying liveness update after storage.errRetryLiveness: result is ambiguous (context done during DistSender.Send: context deadline exceeded)

Expected behavior A clear and concise description of what you expected to happen.

Additional data / screenshots If the problem is SQL-related, include a copy of the SQL query and the schema of the supporting tables.

If a node in your cluster encountered a fatal error, supply the contents of the log directories (at minimum of the affected node(s), but preferably all nodes).

Note that log files can contain confidential information. Please continue creating this issue, but contact support@cockroachlabs.com to submit the log files in private.

If applicable, add screenshots to help explain your problem.

Environment:

Additional context What was the impact?

Add any other context about the problem here.

Jira issue: CRDB-37566

blathers-crl[bot] commented 7 months ago

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.