server: Drain operation failed and drained nodes not accepting sql connections, but is treated as active for other operations

This issue started on drt-ldr on September 11, 16:51, a node-kill/sigkill/drain=true operation was started which does the following two things:

1. Drain Node
./cockroach node drain 
2. Kill cockroach process with
kill -9 <crdb process id>

The operation failed in the middle of drain operation and could not run the operation cleanup step which resulted in server not accepting sql client connection datadog logs for operation failure

Sep 11 16:51:55.331 drt-ldr1-0003 drt-cockroachdb drain failed: some sessions did not respond to cancellation within 1s Sep 11 17:27:13.272 drt-ldr1-0001 drt-cockroachdb drain failed: some sessions did not respond to cancellation within 1s Two nodes were affected drt-ldr1-0001 and drt-ldr1-0003 which were not accepting sql clients. Nodes drt-ldr1-0002, drt-ldr1-0004, drt-ldr1-0005 are working and accepting client connections.

More details in slack thread [link]

Jira issue: CRDB-42266

cockroachdb / cockroach

server: Drain operation failed and drained nodes not accepting sql connections, but is treated as active for other operations #130853