cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.13k stars 3.81k forks source link

server: process not terminating after completing tasks during graceful shutdown #101315

Open kvoli opened 1 year ago

kvoli commented 1 year ago

Describe the problem

When running a rolling restart of nodes in a large cluster (v22.2.7), we noticed that nodes which were being drained, did not shut down until 4+ minutes after all leases, connections and distSQL flows had been shed. The node eventually shut down due to a forceful kill 5 minutes after draining starts.

I230406 00:48:39.420766 1 1@cli/start.go:752 ⋮ [n42] 11175 received signal 'terminated'
I230406 00:48:39.420853 1 1@cli/start.go:851 ⋮ [n42] 11176 initiating graceful shutdown of server
...
I230406 00:49:04.421111 1760734650 1@sql/pgwire/server.go:619 ⋮ [n42,server drain process] 11196 starting draining SQL connections
...
I230406 00:49:09.420939 1760734651 1@cli/start.go:863 ⋮ [n42] 11256 2887 running tasks
I230406 00:49:09.420995 1760735639 1@server/drain.go:399 ⋮ [n42,server drain process] 11257 number of open connections: 0
I230406 00:49:18.495893 1760734650 1@server/drain.go:285 ⋮ [n42,server drain process] 11262 drain remaining: 695
I230406 00:49:18.495966 1760734650 1@server/drain.go:287 ⋮ [n42,server drain process] 11263 drain details: descriptor leases: 40, liveness record: 1, range lease iterations: 595, SQL clients: 58, distSQL execution flows: 1
I230406 00:49:19.420944 1760734651 1@cli/start.go:863 ⋮ [n42] 11264 1327 running tasks
I230406 00:49:19.420944 1760734651 1@cli/start.go:863 ⋮ [n42] 11264 1327 running tasks
I230406 00:49:19.773230 1760734650 1@server/drain.go:285 ⋮ [n42,server drain process] 11265 drain remaining: 0
I230406 00:49:24.420935 1760734651 1@cli/start.go:863 ⋮ [n42] 11266 0 running tasks
...
I230406 00:53:39.421139 1760734651 1@cli/start.go:863 ⋮ [n42] 11317 0 running tasks
# Restarts here. Starts back at 00:53:58.
CockroachDB node starting at 2023-04-06 00:53:58.770391738 +0000 UTC m=+4.311766099 (took 4.0s)

Other nodes had their RPC connection to the draining node EOF'd once there were 0 running tasks on the draining node:

W230406 00:49:18.498855 207579632 kv/kvserver/closedts/sidetransport/receiver.go:139 ⋮ [n44] 140763 closed timestamps side-transport connection dropped from node: 42
W230406 00:49:19.774610 1304307236 kv/kvserver/raft_transport.go:602 ⋮ [n44] 140764 while processing outgoing Raft queue to node 42: ‹EOF›:
W230406 00:49:19.801128 1733318404 kv/kvserver/raft_transport.go:602 ⋮ [n44] 140766 while processing outgoing Raft queue to node 42: recv msg error: ‹rpc error: code = Unknown desc = node unavailable; try another peer›:

However, since the process didn't actually stop until 00:53:39, there was a 260s period where the process was running but unable to be reached for raft traffic. Due to another issue, there were paused followers on other nodes. The combination of no raft connectivity + paused followers caused range unavailability which stalled the workload.

Example request timing out due to unavailability ``` W230406 00:50:23.895755 1723804188 kv/txn.go:705 ⋮ [n44,client=10.11.3.64:19274,user=‹roach›] 143007 failure aborting transaction: replica unavailable: (n6,s6):3 unable to serve request to r37576:‹/Table/109/2/-484{9033071567352390/"\x00\x00\x00\x00\x00\x00\x00\x00\xbc\xb4\xc6\x1c\xd9\x1bM\xba"-1254572419284788/"\x00\x00\x00\x00\x00\x00\x00\x00\xbc\xd0h\x9d϶\x80\xcc"}› [(n42,s42):1, (n9,s9):2, (n6,s6):3, next=4, gen=59]: closed timestamp: 1680742156.771539906,0 (2023-04-06 00:49:16); raft status: {"id":"3","term":7,"vote":"3","commit":4740189,"lead":"3","raftState":"StateLeader","applied":4740189,"progress":{"1":{"match":4740189,"next":4740396,"state":"StateReplicate"},"2":{"match":0,"next":4651570,"state":"StateSnapshot"},"3":{"match":4755332,"next":4755333,"state":"StateReplicate"}},"leadtransferee":"0"}: encountered poisoned latch ‹/Table/109/2/-4846630336183874321/"\x00\x00\x00\x00\x00\x00\x00\x00\xbc\xbdOcF \xcc\xef"/0›@1680742162.477208011,0: "sql txn" meta={id=184e5d3f key=/Table/109/1/‹"\x00\x00\x00\x00\x00\x00\x00\x00\xbc\xbdOcF \xcc\xef"›/‹0› pri=0.00017057 epo=0 ts=1680742162.477208011,0 min=1680742162.477208011,0 seq=3} lock=true stat=PENDING rts=1680742162.477208011,0 wto=false gul=1680742162.727208011,0; abort caused by: result is ambiguous: replica unavailable: (n6,s6):3 unable to serve request to r37576:‹/Table/109/2/-484{9033071567352390/"\x00\x00\x00\x00\x00\x00\x00\x00\xbc\xb4\xc6\x1c\xd9\x1bM\xba"-1254572419284788/"\x00\x00\x00\x00\x00\x00\x00\x00\xbc\xd0h\x9d϶\x80\xcc"}› [(n42,s42):1, (n9,s9):2, (n6,s6):3, next=4, gen=59]: closed timestamp: 1680742156.771539906,0 (2023-04-06 00:49:16); raft status: {"id":"3","term":7,"vote":"3","commit":4740189,"lead":"3","raftState":"StateLeader","applied":4740189,"progress":{"1":{"match":4740189,"next":4740396,"state":"StateReplicate"},"2":{"match":0,"next":4651570,"state":"StateSnapshot"},"3":{"match":4755330,"next":4755331,"state":"StateReplicate"}},"leadtransferee":"0"}: have been waiting 63.00s for slow proposal InitPut [/Table/109/2/‹-4847935104247747320›/‹"\x00\x00\x00\x00\x00\x00\x00\x00\xbc\xb8\xac\xb5I*\xad\b"›/‹0›,/Min), [txn: 7e70532e] ```

To Reproduce

This was on CC with 100 nodes, where a rolling restart was conducted. I haven't tried to reproduce in a smaller roachprod cluster yet, however the steps I'd take are:

  1. Setup 5 node CRDB cluster running v22.2.7
  2. Run the indexes workload on the cluster.
  3. Send terminate signal to the CRDB processes in a rolling fashion and restart them.
  4. Observe if any nodes idle at the I230406 00:53:39.421139 1760734651 1@cli/start.go:863 ⋮ [n42] 11317 0 running tasks stage for longer than expected before terminating.

Expected behavior Once all the necessary draining has completed, the node shuts down promptly - in under 5 seconds.

Additional data / screenshots

Workload stall

image

Logs from the node the snippet above is taken from:

Root Log ``` I230406 00:48:39.420766 1 1@cli/start.go:752 ⋮ [n42] 11175 received signal 'terminated' I230406 00:48:39.420853 1 1@cli/start.go:851 ⋮ [n42] 11176 initiating graceful shutdown of server I230406 00:48:39.420919 1760734650 1@server/drain.go:337 ⋮ [n42,server drain process] 11177 waiting for health probes to notice that the node is not ready for new sql connections I230406 00:48:42.421007 1760735639 1@server/drain.go:399 ⋮ [n42,server drain process] 11178 number of open connections: 58 E230406 00:48:43.325311 98 1@util/log/buffered_sink.go:268 ⋮ [-] 11179 logging error from *log.fluentSink: write tcp ‹10.244.2.241:36888› -> ‹10.244.2.87:5171›: write: connection reset by peer I230406 00:48:44.420935 1760734651 1@cli/start.go:863 ⋮ [n42] 11180 3096 running tasks I230406 00:48:45.421138 1760735639 1@server/drain.go:399 ⋮ [n42,server drain process] 11181 number of open connections: 58 E230406 00:48:48.333292 98 1@util/log/buffered_sink.go:268 ⋮ [-] 11182 logging error from *log.fluentSink: write tcp ‹10.244.2.241:36904› -> ‹10.244.2.87:5171›: write: connection reset by peer I230406 00:48:48.421000 1760735639 1@server/drain.go:399 ⋮ [n42,server drain process] 11183 number of open connections: 58 I230406 00:48:49.422383 1760734651 1@cli/start.go:863 ⋮ [n42] 11184 3348 running tasks I230406 00:48:51.421015 1760735639 1@server/drain.go:399 ⋮ [n42,server drain process] 11185 number of open connections: 58 E230406 00:48:53.339549 98 1@util/log/buffered_sink.go:268 ⋮ [-] 11186 logging error from *log.fluentSink: write tcp ‹10.244.2.241:59736› -> ‹10.244.2.87:5171›: write: connection reset by peer I230406 00:48:54.420989 1760734651 1@cli/start.go:863 ⋮ [n42] 11187 3055 running tasks I230406 00:48:54.421009 1760735639 1@server/drain.go:399 ⋮ [n42,server drain process] 11188 number of open connections: 58 I230406 00:48:57.431020 1760735639 1@server/drain.go:399 ⋮ [n42,server drain process] 11189 number of open connections: 58 E230406 00:48:58.354290 98 1@util/log/buffered_sink.go:268 ⋮ [-] 11190 logging error from *log.fluentSink: write tcp ‹10.244.2.241:59744› -> ‹10.244.2.87:5171›: write: connection reset by peer I230406 00:48:59.420951 1760734651 1@cli/start.go:863 ⋮ [n42] 11191 3051 running tasks I230406 00:49:00.421127 1760735639 1@server/drain.go:399 ⋮ [n42,server drain process] 11192 number of open connections: 58 E230406 00:49:03.361896 98 1@util/log/buffered_sink.go:268 ⋮ [-] 11193 logging error from *log.fluentSink: write tcp ‹10.244.2.241:45328› -> ‹10.244.2.87:5171›: write: connection reset by peer I230406 00:49:03.421087 1760735639 1@server/drain.go:399 ⋮ [n42,server drain process] 11194 number of open connections: 58 I230406 00:49:04.421018 1760734651 1@cli/start.go:863 ⋮ [n42] 11195 2967 running tasks I230406 00:49:04.421111 1760734650 1@sql/pgwire/server.go:619 ⋮ [n42,server drain process] 11196 starting draining SQL connections I230406 00:49:04.428438 1710577792 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:3284,user=‹roach›] 11197 closing existing connection while server is draining I230406 00:49:04.430036 1710570888 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:64911,user=‹roach›] 11198 closing existing connection while server is draining I230406 00:49:04.430586 1707560117 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:26995,user=‹roach›] 11199 closing existing connection while server is draining I230406 00:49:04.430775 1710549624 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:28714,user=‹roach›] 11200 closing existing connection while server is draining I230406 00:49:04.431397 1710579341 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:58736,user=‹roach›] 11201 closing existing connection while server is draining I230406 00:49:04.432124 1739573757 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:2496,user=‹roach›] 11202 closing existing connection while server is draining I230406 00:49:04.432880 1710566961 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:60450,user=‹roach›] 11203 closing existing connection while server is draining I230406 00:49:04.433961 1707516144 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:30658,user=‹roach›] 11204 closing existing connection while server is draining I230406 00:49:04.434043 1716930538 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:5748,user=‹roach›] 11205 closing existing connection while server is draining I230406 00:49:04.434071 1710583925 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:25973,user=‹roach›] 11206 closing existing connection while server is draining I230406 00:49:04.434362 1670496510 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:37063,user=‹roach›] 11207 closing existing connection while server is draining I230406 00:49:04.434672 1676347118 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:30670,user=‹roach›] 11208 closing existing connection while server is draining I230406 00:49:04.435034 1751720470 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:16702,user=‹roach›] 11209 closing existing connection while server is draining I230406 00:49:04.435080 1710562261 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:7429,user=‹roach›] 11210 closing existing connection while server is draining I230406 00:49:04.435224 1710581402 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:50993,user=‹roach›] 11211 closing existing connection while server is draining I230406 00:49:04.436083 1726395384 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:4581,user=‹roach›] 11212 closing existing connection while server is draining I230406 00:49:04.437922 1707301115 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.1.227:34854,user=‹roach›] 11213 closing existing connection while server is draining I230406 00:49:04.438671 1733517499 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:21725,user=‹roach›] 11214 closing existing connection while server is draining I230406 00:49:04.440650 1695010557 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:26502,user=‹roach›] 11215 closing existing connection while server is draining I230406 00:49:04.440783 1710597488 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:47754,user=‹roach›] 11216 closing existing connection while server is draining I230406 00:49:04.441033 1713101340 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:47863,user=‹roach›] 11217 closing existing connection while server is draining I230406 00:49:04.442776 1714569160 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:18524,user=‹roach›] 11218 closing existing connection while server is draining I230406 00:49:04.443337 1707558869 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:15117,user=‹roach›] 11219 closing existing connection while server is draining I230406 00:49:04.443907 1710589257 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:40601,user=‹roach›] 11220 closing existing connection while server is draining I230406 00:49:04.445481 1713130202 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:4609,user=‹roach›] 11221 closing existing connection while server is draining I230406 00:49:04.445548 1707576525 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:50742,user=‹roach›] 11222 closing existing connection while server is draining I230406 00:49:04.445696 1707565650 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:18193,user=‹roach›] 11223 closing existing connection while server is draining I230406 00:49:04.446141 1710576786 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:7182,user=‹roach›] 11224 closing existing connection while server is draining I230406 00:49:04.446395 1672695563 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:47479,user=‹roach›] 11225 closing existing connection while server is draining I230406 00:49:04.447280 1713139594 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:8263,user=‹roach›] 11226 closing existing connection while server is draining I230406 00:49:04.448983 1710597068 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:52108,user=‹roach›] 11227 closing existing connection while server is draining I230406 00:49:04.450067 1716931149 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:1662,user=‹roach›] 11228 closing existing connection while server is draining I230406 00:49:04.450163 1707534701 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:35615,user=‹roach›] 11229 closing existing connection while server is draining I230406 00:49:04.451398 1707578378 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:46946,user=‹roach›] 11230 closing existing connection while server is draining I230406 00:49:04.456098 1707546308 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:52862,user=‹roach›] 11231 closing existing connection while server is draining I230406 00:49:04.456329 1746342400 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:19686,user=‹roach›] 11232 closing existing connection while server is draining I230406 00:49:04.459215 1710588029 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:35594,user=‹roach›] 11233 closing existing connection while server is draining I230406 00:49:04.459227 1713136266 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:64259,user=‹roach›] 11234 closing existing connection while server is draining I230406 00:49:04.461078 1710584147 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:12046,user=‹roach›] 11235 closing existing connection while server is draining I230406 00:49:04.461320 1696836674 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:19370,user=‹roach›] 11236 closing existing connection while server is draining I230406 00:49:04.473434 1710557348 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:48805,user=‹roach›] 11237 closing existing connection while server is draining I230406 00:49:04.474666 1713129155 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:40937,user=‹roach›] 11238 closing existing connection while server is draining I230406 00:49:04.474983 1710562764 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:39939,user=‹roach›] 11239 closing existing connection while server is draining I230406 00:49:04.479232 1707529376 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:29506,user=‹roach›] 11240 closing existing connection while server is draining I230406 00:49:04.479944 1679966705 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:29059,user=‹roach›] 11241 closing existing connection while server is draining I230406 00:49:04.482031 1707535039 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:38949,user=‹roach›] 11242 closing existing connection while server is draining I230406 00:49:04.491400 1748104008 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.1.227:55969,user=‹roach›] 11243 closing existing connection while server is draining I230406 00:49:04.492896 1713152681 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:22001,user=‹roach›] 11244 closing existing connection while server is draining I230406 00:49:04.492973 1713131459 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:14659,user=‹roach›] 11245 closing existing connection while server is draining I230406 00:49:04.498319 1713123581 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:22457,user=‹roach›] 11246 closing existing connection while server is draining I230406 00:49:04.508639 1707522938 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:50110,user=‹roach›] 11247 closing existing connection while server is draining I230406 00:49:05.247713 1707512513 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:44372,user=‹roach›] 11248 closing existing connection while server is draining I230406 00:49:05.260191 1710586325 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:12296,user=‹roach›] 11249 closing existing connection while server is draining I230406 00:49:05.459229 1700245711 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:41527,user=‹roach›] 11250 closing existing connection while server is draining I230406 00:49:05.459316 1700247135 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:13463,user=‹roach›] 11251 closing existing connection while server is draining I230406 00:49:05.467212 1688973888 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.3.64:3332,user=‹roach›] 11252 closing existing connection while server is draining I230406 00:49:05.473208 1689043432 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:33213,user=‹roach›] 11253 closing existing connection while server is draining I230406 00:49:06.421161 1760735639 1@server/drain.go:399 ⋮ [n42,server drain process] 11254 number of open connections: 1 I230406 00:49:06.964196 1710561960 1@sql/pgwire/conn.go:601 ⋮ [n42,client=10.11.2.233:13999,user=‹roach›] 11255 closing existing connection while server is draining I230406 00:49:09.420939 1760734651 1@cli/start.go:863 ⋮ [n42] 11256 2887 running tasks I230406 00:49:09.420995 1760735639 1@server/drain.go:399 ⋮ [n42,server drain process] 11257 number of open connections: 0 I230406 00:49:12.421015 1760735639 1@server/drain.go:399 ⋮ [n42,server drain process] 11258 number of open connections: 0 I230406 00:49:14.420932 1760734651 1@cli/start.go:863 ⋮ [n42] 11259 2953 running tasks I230406 00:49:15.421000 1760735639 1@server/drain.go:399 ⋮ [n42,server drain process] 11260 number of open connections: 0 I230406 00:49:18.421007 1760735639 1@server/drain.go:399 ⋮ [n42,server drain process] 11261 number of open connections: 0 I230406 00:49:18.495893 1760734650 1@server/drain.go:285 ⋮ [n42,server drain process] 11262 drain remaining: 695 I230406 00:49:18.495966 1760734650 1@server/drain.go:287 ⋮ [n42,server drain process] 11263 drain details: descriptor leases: 40, liveness record: 1, range lease iterations: 595, SQL clients: 58, distSQL execution flows: 1 I230406 00:49:19.420944 1760734651 1@cli/start.go:863 ⋮ [n42] 11264 1327 running tasks I230406 00:49:19.773230 1760734650 1@server/drain.go:285 ⋮ [n42,server drain process] 11265 drain remaining: 0 I230406 00:49:24.420935 1760734651 1@cli/start.go:863 ⋮ [n42] 11266 0 running tasks I230406 00:49:29.420945 1760734651 1@cli/start.go:863 ⋮ [n42] 11267 0 running tasks I230406 00:49:34.420945 1760734651 1@cli/start.go:863 ⋮ [n42] 11268 0 running tasks I230406 00:49:39.420961 1760734651 1@cli/start.go:863 ⋮ [n42] 11269 0 running tasks I230406 00:49:44.421845 1760734651 1@cli/start.go:863 ⋮ [n42] 11270 0 running tasks I230406 00:49:49.421192 1760734651 1@cli/start.go:863 ⋮ [n42] 11271 0 running tasks I230406 00:49:54.421065 1760734651 1@cli/start.go:863 ⋮ [n42] 11272 0 running tasks I230406 00:49:59.421983 1760734651 1@cli/start.go:863 ⋮ [n42] 11273 0 running tasks I230406 00:50:04.421945 1760734651 1@cli/start.go:863 ⋮ [n42] 11274 0 running tasks I230406 00:50:09.420945 1760734651 1@cli/start.go:863 ⋮ [n42] 11275 0 running tasks I230406 00:50:14.421415 1760734651 1@cli/start.go:863 ⋮ [n42] 11276 0 running tasks I230406 00:50:19.421108 1760734651 1@cli/start.go:863 ⋮ [n42] 11277 0 running tasks I230406 00:50:24.420937 1760734651 1@cli/start.go:863 ⋮ [n42] 11278 0 running tasks I230406 00:50:29.420978 1760734651 1@cli/start.go:863 ⋮ [n42] 11279 0 running tasks I230406 00:50:34.421120 1760734651 1@cli/start.go:863 ⋮ [n42] 11280 0 running tasks I230406 00:50:39.421012 1760734651 1@cli/start.go:863 ⋮ [n42] 11281 0 running tasks I230406 00:50:44.420943 1760734651 1@cli/start.go:863 ⋮ [n42] 11282 0 running tasks I230406 00:50:49.421079 1760734651 1@cli/start.go:863 ⋮ [n42] 11283 0 running tasks I230406 00:50:54.421022 1760734651 1@cli/start.go:863 ⋮ [n42] 11284 0 running tasks I230406 00:50:59.420992 1760734651 1@cli/start.go:863 ⋮ [n42] 11285 0 running tasks I230406 00:51:04.421217 1760734651 1@cli/start.go:863 ⋮ [n42] 11286 0 running tasks I230406 00:51:09.420958 1760734651 1@cli/start.go:863 ⋮ [n42] 11287 0 running tasks I230406 00:51:14.421016 1760734651 1@cli/start.go:863 ⋮ [n42] 11288 0 running tasks I230406 00:51:19.421063 1760734651 1@cli/start.go:863 ⋮ [n42] 11289 0 running tasks I230406 00:51:24.421004 1760734651 1@cli/start.go:863 ⋮ [n42] 11290 0 running tasks I230406 00:51:29.421026 1760734651 1@cli/start.go:863 ⋮ [n42] 11291 0 running tasks I230406 00:51:34.421433 1760734651 1@cli/start.go:863 ⋮ [n42] 11292 0 running tasks I230406 00:51:39.421037 1760734651 1@cli/start.go:863 ⋮ [n42] 11293 0 running tasks I230406 00:51:44.421188 1760734651 1@cli/start.go:863 ⋮ [n42] 11294 0 running tasks I230406 00:51:49.421105 1760734651 1@cli/start.go:863 ⋮ [n42] 11295 0 running tasks I230406 00:51:54.421332 1760734651 1@cli/start.go:863 ⋮ [n42] 11296 0 running tasks I230406 00:51:59.421015 1760734651 1@cli/start.go:863 ⋮ [n42] 11297 0 running tasks I230406 00:52:04.421150 1760734651 1@cli/start.go:863 ⋮ [n42] 11298 0 running tasks I230406 00:52:09.421025 1760734651 1@cli/start.go:863 ⋮ [n42] 11299 0 running tasks I230406 00:52:14.421051 1760734651 1@cli/start.go:863 ⋮ [n42] 11300 0 running tasks I230406 00:52:19.420979 1760734651 1@cli/start.go:863 ⋮ [n42] 11301 0 running tasks I230406 00:52:24.421088 1760734651 1@cli/start.go:863 ⋮ [n42] 11302 0 running tasks I230406 00:52:29.421153 1760734651 1@cli/start.go:863 ⋮ [n42] 11303 0 running tasks I230406 00:52:34.421122 1760734651 1@cli/start.go:863 ⋮ [n42] 11304 0 running tasks I230406 00:52:39.421181 1760734651 1@cli/start.go:863 ⋮ [n42] 11305 0 running tasks I230406 00:52:44.420948 1760734651 1@cli/start.go:863 ⋮ [n42] 11306 0 running tasks I230406 00:52:49.421155 1760734651 1@cli/start.go:863 ⋮ [n42] 11307 0 running tasks I230406 00:52:54.421117 1760734651 1@cli/start.go:863 ⋮ [n42] 11308 0 running tasks I230406 00:52:59.421068 1760734651 1@cli/start.go:863 ⋮ [n42] 11309 0 running tasks I230406 00:53:04.421114 1760734651 1@cli/start.go:863 ⋮ [n42] 11310 0 running tasks I230406 00:53:09.421020 1760734651 1@cli/start.go:863 ⋮ [n42] 11311 0 running tasks I230406 00:53:14.421109 1760734651 1@cli/start.go:863 ⋮ [n42] 11312 0 running tasks I230406 00:53:19.420952 1760734651 1@cli/start.go:863 ⋮ [n42] 11313 0 running tasks I230406 00:53:24.421049 1760734651 1@cli/start.go:863 ⋮ [n42] 11314 0 running tasks I230406 00:53:29.421029 1760734651 1@cli/start.go:863 ⋮ [n42] 11315 0 running tasks I230406 00:53:34.420972 1760734651 1@cli/start.go:863 ⋮ [n42] 11316 0 running tasks I230406 00:53:39.421139 1760734651 1@cli/start.go:863 ⋮ [n42] 11317 0 running tasks ```

Environment:

Jira issue: CRDB-26899

rafiss commented 1 year ago

@yuzefovich do you know if this is different from your fix in https://github.com/cockroachdb/cockroach/pull/100761?

yuzefovich commented 1 year ago

100761 requires that distSQL execution flows never goes to zero, but it seems like it did in this case, so it's different.

knz commented 1 year ago

If the message "0 running tasks" was displayed, this means that we finished graceful drain successfully and the stopper wasn't seeing any tasks remaining and yet wasn't able to complete the Stop call.

This indicates that one of the closer functions got stuck and couldn't complete.

@kvoli do we still have artifacts from that cluster? Is the problem readily reproducible? Could we get a goroutine dump when the problem occurs?

kvoli commented 1 year ago

@kvoli do we still have artifacts from that cluster? Is the problem readily reproducible? Could we get a goroutine dump when the problem occurs?

I couldn't repro with the steps above on a 9 node cluster with the same version. The artifacts are available, we didn't get a stack traces from the restarts I looked at. I've sent you a link to the debug zip over slack.