Open csgourav opened 2 months ago
Hi @csgourav, please add branch-* labels to identify which branch(es) this C-bug affects.
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.
On drt-chaos
we have been running node-kill/drain=true
for few months now. Recently in past 2-3 weeks we have been observing node drain operation failing intermittently with below error
drain failed: some sessions did not respond to cancellation within 1s
Observations after this drain failure:
In this state, a partially drained node will not accept sql connections. But, thedbconsole
and dsp.PartitionSpans
are unaware of the partially drained node, because they rely on gossip, which means the dbconsole will report a health node and partitionSpans will continue planning work on the partially drained node, which cannot accept sql connections,resulting in failure of other subsystem like LDR.
Recently we also saw issue with draining on drt-chaos
where the node is stuck in graceful shutdown state forever with below error message
40919 06:44:20.006940 72997290 1@cli/start.go:1056 ⋮ [T1,Vsystem,n8] 71165 1 running tasks
I240919 06:44:25.007418 72997290 1@cli/start.go:1056 ⋮ [T1,Vsystem,n8] 71166 1 running tasks
I240919 06:44:30.007014 72997290 1@cli/start.go:1056 ⋮ [T1,Vsystem,n8] 71167 1 running tasks
This is from waitForShutdown()
incli/start.go
-- seems like node is stuck in shutdown
Similar issue on drt-ldr paused LDR jobs which resulted in stopped replication and gc pause. This is triaged in detail, in this slack thread. @stevendanna has debugged this and raised following question which i think should be looked into:
Why we thinks this is a P1
issue :
As it is stopping us to do node-kill with drain operations on drt-clusters.
In the short term, we could try adding a single retry into the drain step of the node-kill operation as an experiment?
This issue started on drt-ldr on September 11, 16:51, a node-kill/sigkill/drain=true operation was started which does the following two things:
The operation failed in the middle of drain operation and could not run the operation cleanup step which resulted in server not accepting sql client connection datadog logs for operation failure
Sep 11 16:51:55.331 drt-ldr1-0003 drt-cockroachdb drain failed: some sessions did not respond to cancellation within 1s
Sep 11 17:27:13.272 drt-ldr1-0001 drt-cockroachdb drain failed: some sessions did not respond to cancellation within 1s
Two nodes were affected drt-ldr1-0001 and drt-ldr1-0003 which were not accepting sql clients. Nodes drt-ldr1-0002, drt-ldr1-0004, drt-ldr1-0005 are working and accepting client connections.More details in slack thread [link]
Jira issue: CRDB-42266