cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

pgwire: session for cancel key not found #91386

Closed erikgrinaker closed 1 year ago

erikgrinaker commented 1 year ago

Was looking around Telemetry running 22.2.0-beta.5, and saw a handful of these in the logs:

4@sql/pgwire/server.go:937 ⋮ [n25,client=‹10.11.3.67:41642›] 1 unexpected while handling pgwire cancellation request: error from CancelQueryByKeyResponse: ‹session for cancel key 133495544000590100 not found›

Not sure if this indicates a bug, but at the very least, doesn't seem necessary to log it.

Jira issue: CRDB-21241

rafiss commented 1 year ago

The reason this is logged is because the pgwire cancel request is sent over an unauthenticated connection. If we see a lot of messages with the same client IP that are about sending invalid cancellation requests, then it could mean a malicious entity is trying to spam the endpoint and cancel a query.

erikgrinaker commented 1 year ago

I see. Feel free to close this one out if it's working as intended (although the log message could do with a hint).

rafiss commented 1 year ago

Hm, I generally don't think a log message should be in the business of saying "why we are logging this" - it should just say "what happened." But maybe this could be an exception. Note that the comment around the message does say "why we are logging this" since I think that's what comments are intended for. https://github.com/cockroachdb/cockroach/blob/64e13082cede5baaf91c8e0f9d3e2ac37f53fb31/pkg/sql/pgwire/server.go#L895-L903

erikgrinaker commented 1 year ago

Operators don't read comments. :) It isn't clear to me how an operator would interpret that log message as a potential DoS attack. To me, it reads more like an internal assertion failure. We have some prior art for adding hints to log messages, e.g. "node might be overloaded" here (which was added because operators were always confused about what this log message meant):

handle raft ready: 1.6s [applied=4, batches=4, state_assertions=0]; node might be overloaded

Just a thought, feel free to close this out if it's working as intended.

rafiss commented 1 year ago

I'll close this for now, but open to reopening it if we hear about more confusion.

andreimatei commented 1 year ago

I think this message is way too trigger-happy. It's triggered simply by having a sql shell open while restarting the server; the first query sent after the restart is met with the scary text below.

I'm opening for reconsideration. As far as I understand, there's nothing "unexpected" about it, contrary to what the message suggests (let alone the stack trace).

W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147  unexpected while handling pgwire cancellation request: error from CancelQueryByKeyResponse: ‹session for cancel key 7391303052640420 not found›
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 +(1) attached stack trace
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 +  -- stack trace:
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 +  | github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*Server).HandleCancel.func1
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 +  |      github.com/cockroachdb/cockroach/pkg/sql/pgwire/server.go:830
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 +  | github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*Server).HandleCancel
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 +  |      github.com/cockroachdb/cockroach/pkg/sql/pgwire/server.go:833
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 +  | github.com/cockroachdb/cockroach/pkg/server.(*systemServerWrapper).handleCancel
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 +  |      github.com/cockroachdb/cockroach/pkg/server/server_controller.go:747
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 +  | github.com/cockroachdb/cockroach/pkg/server.(*serverController).sqlMux.func1
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 +  |      github.com/cockroachdb/cockroach/pkg/server/server_controller.go:553
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 +  |      github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 +  | runtime.goexit
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 +  |      GOROOT/src/runtime/asm_amd64.s:1594
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 +Wraps: (2) error from CancelQueryByKeyResponse: ‹session for cancel key 7391303052640420 not found›
erikgrinaker commented 1 year ago

These were a source of confusion in https://github.com/cockroachlabs/support/issues/2063.

rafiss commented 1 year ago

Logging the stack trace is definitely not intentional. It seems like that behavior is only on master and is due to the usage of the %+v format verb in 3ff9bc9d1aebe9285b844a0a14a3d440d62149d2.

Sending a cancel request when the shell connects to a restarted node is also not intentional.

However, I strongly feel that an abundance of unsuccessful cancellation requests is an indicator of malicious activity. @andreimatei do you have suggestions on how the errors should be handled? (Recall that the pgwire cancel protocol occurs over an unauthenticated connection, and does not return any success/error message to the client.) What if we only log if a certain threshold of errors is reached during a time window?

andreimatei commented 1 year ago

What if we only log if a certain threshold of errors is reached during a time window?

That sounds like a good idea to me.