cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.17k stars 3.82k forks source link

server: Drain operation failed and drained nodes not accepting sql connections, but is treated as active for other operations #130853

Open csgourav opened 2 months ago

csgourav commented 2 months ago

This issue started on drt-ldr on September 11, 16:51, a node-kill/sigkill/drain=true operation was started which does the following two things:

1. Drain Node
./cockroach node drain 
2. Kill cockroach process with
kill -9 <crdb process id>

The operation failed in the middle of drain operation and could not run the operation cleanup step which resulted in server not accepting sql client connection datadog logs for operation failure

Sep 11 16:51:55.331 drt-ldr1-0003 drt-cockroachdb drain failed: some sessions did not respond to cancellation within 1s Sep 11 17:27:13.272 drt-ldr1-0001 drt-cockroachdb drain failed: some sessions did not respond to cancellation within 1s Two nodes were affected drt-ldr1-0001 and drt-ldr1-0003 which were not accepting sql clients. Nodes drt-ldr1-0002, drt-ldr1-0004, drt-ldr1-0005 are working and accepting client connections.

More details in slack thread [link]

Jira issue: CRDB-42266

blathers-crl[bot] commented 2 months ago

Hi @csgourav, please add branch-* labels to identify which branch(es) this C-bug affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

shailendra-patel commented 1 month ago

On drt-chaos we have been running node-kill/drain=true for few months now. Recently in past 2-3 weeks we have been observing node drain operation failing intermittently with below error drain failed: some sessions did not respond to cancellation within 1s Observations after this drain failure:

  1. Node stop accepting sql client connections.
  2. In this state, a partially drained node will not accept sql connections. But, thedbconsole and dsp.PartitionSpans are unaware of the partially drained node, because they rely on gossip, which means the dbconsole will report a health node and partitionSpans will continue planning work on the partially drained node, which cannot accept sql connections,resulting in failure of other subsystem like LDR.

    Recently we also saw issue with draining on drt-chaos where the node is stuck in graceful shutdown state forever with below error message

40919 06:44:20.006940 72997290 1@cli/start.go:1056 ⋮ [T1,Vsystem,n8] 71165  1 running tasks
I240919 06:44:25.007418 72997290 1@cli/start.go:1056 ⋮ [T1,Vsystem,n8] 71166  1 running tasks
I240919 06:44:30.007014 72997290 1@cli/start.go:1056 ⋮ [T1,Vsystem,n8] 71167  1 running tasks

This is from waitForShutdown() incli/start.go -- seems like node is stuck in shutdown

Similar issue on drt-ldr paused LDR jobs which resulted in stopped replication and gc pause. This is triaged in detail, in this slack thread. @stevendanna has debugged this and raised following question which i think should be looked into:

  1. The drain failed because of a 1 second timeout. Is this one second timeout reasonsable? If so, which process was taking longer than this reasonable 1 second and does it need to be fixed? If not, should it be increased or made configurable.
  2. The drain failed in the middle. Is the state the the server was left in reasonable? To me, it felt a bit unreasonable and it seems like either
  1. The new SQL-instances based planning sets the draining flag much earlier. In some sense that would have prevented this problem, but perhaps it introduces other problems with respect to in-flight requests that are running DistSQL. If we end up unifying the SQL-based and Gossip-based planning, this may represent a behaviour change we need to think about.
  2. This was caught by LDR because LDR (and PCR) kinda violate an assumption of the distsql physical planner: namely they use the physical plan for non-distsql work: making new SQL connections. I wonder if we want some new flag/option to make it clear that the plan we are doing requires nodes available via SQL if we end up making other changes because of 2 & 3.

Why we thinks this is a P1 issue :

  1. As it is stopping us to do node-kill with drain operations on drt-clusters.
  2. This problem is likely to occur in production as well with drain.
stevendanna commented 1 month ago

As it is stopping us to do node-kill with drain operations on drt-clusters.

In the short term, we could try adding a single retry into the drain step of the node-kill operation as an experiment?