Closed erikgrinaker closed 1 year ago
The reason this is logged is because the pgwire cancel request is sent over an unauthenticated connection. If we see a lot of messages with the same client IP that are about sending invalid cancellation requests, then it could mean a malicious entity is trying to spam the endpoint and cancel a query.
I see. Feel free to close this one out if it's working as intended (although the log message could do with a hint).
Hm, I generally don't think a log message should be in the business of saying "why we are logging this" - it should just say "what happened." But maybe this could be an exception. Note that the comment around the message does say "why we are logging this" since I think that's what comments are intended for. https://github.com/cockroachdb/cockroach/blob/64e13082cede5baaf91c8e0f9d3e2ac37f53fb31/pkg/sql/pgwire/server.go#L895-L903
Operators don't read comments. :) It isn't clear to me how an operator would interpret that log message as a potential DoS attack. To me, it reads more like an internal assertion failure. We have some prior art for adding hints to log messages, e.g. "node might be overloaded" here (which was added because operators were always confused about what this log message meant):
handle raft ready: 1.6s [applied=4, batches=4, state_assertions=0]; node might be overloaded
Just a thought, feel free to close this out if it's working as intended.
I'll close this for now, but open to reopening it if we hear about more confusion.
I think this message is way too trigger-happy. It's triggered simply by having a sql shell open while restarting the server; the first query sent after the restart is met with the scary text below.
I'm opening for reconsideration. As far as I understand, there's nothing "unexpected" about it, contrary to what the message suggests (let alone the stack trace).
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 unexpected while handling pgwire cancellation request: error from CancelQueryByKeyResponse: ‹session for cancel key 7391303052640420 not found›
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 +(1) attached stack trace
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 + -- stack trace:
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 + | github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*Server).HandleCancel.func1
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 + | github.com/cockroachdb/cockroach/pkg/sql/pgwire/server.go:830
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 + | github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*Server).HandleCancel
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 + | github.com/cockroachdb/cockroach/pkg/sql/pgwire/server.go:833
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 + | github.com/cockroachdb/cockroach/pkg/server.(*systemServerWrapper).handleCancel
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 + | github.com/cockroachdb/cockroach/pkg/server/server_controller.go:747
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 + | github.com/cockroachdb/cockroach/pkg/server.(*serverController).sqlMux.func1
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 + | github.com/cockroachdb/cockroach/pkg/server/server_controller.go:553
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 + | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 + | github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 + | runtime.goexit
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 + | GOROOT/src/runtime/asm_amd64.s:1594
W230123 18:00:38.093148 1279 4@server/server_controller.go:573 ⋮ [T1,n1,client=‹127.0.0.1:51116›] 147 +Wraps: (2) error from CancelQueryByKeyResponse: ‹session for cancel key 7391303052640420 not found›
These were a source of confusion in https://github.com/cockroachlabs/support/issues/2063.
Logging the stack trace is definitely not intentional. It seems like that behavior is only on master and is due to the usage of the %+v
format verb in 3ff9bc9d1aebe9285b844a0a14a3d440d62149d2.
Sending a cancel request when the shell connects to a restarted node is also not intentional.
However, I strongly feel that an abundance of unsuccessful cancellation requests is an indicator of malicious activity. @andreimatei do you have suggestions on how the errors should be handled? (Recall that the pgwire cancel protocol occurs over an unauthenticated connection, and does not return any success/error message to the client.) What if we only log if a certain threshold of errors is reached during a time window?
What if we only log if a certain threshold of errors is reached during a time window?
That sounds like a good idea to me.
Was looking around Telemetry running 22.2.0-beta.5, and saw a handful of these in the logs:
Not sure if this indicates a bug, but at the very least, doesn't seem necessary to log it.
Jira issue: CRDB-21241