sql: invalid connEx.application_name causes entire cluster to go down when querying node_queries/_sessions

joeledstrom commented 6 years ago

All nodes panicked at the same time in a 3 node cluster running on GKE, on persistent volumes.

One node panicked with

panic: panic while executing 1 statements: SELECT * FROM _._ LIMIT _; caused by interface conversion: interface {} is nil, not string

goroutine 3070503 [running]:
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).closeWrapper(0xc42bfaea00, 0x2ff6340, 0xc43055b240, 0x2758be0, 0xc42938a480)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:677 +0x36f
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn.func1(0xc42bfaea00, 0x2ff6340, 0xc43055b240)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:418 +0x61
panic(0x2758be0, 0xc42938a480)
    /usr/local/go/src/runtime/panic.go:502 +0x229
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).serialize(0xc42cf83500, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:2084 +0xa00
github.com/cockroachdb/cockroach/pkg/sql.(*SessionRegistry).SerializeAll(0xc42028c3f0, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/exec_util.go:854 +0x1bb
github.com/cockroachdb/cockroach/pkg/server.(*statusServer).ListLocalSessions(0xc4206fc480, 0x2ff6400, 0xc420b73cb0, 0xc429be4bf0, 0x28b6da0, 0x1, 0xc42f47c6e0)
    /go/src/github.com/cockroachdb/cockroach/pkg/server/status.go:1361 +0x9e

And the other two with

panic: panic while executing 1 statements: SELECT count(_) FROM _._; caused by interface conversion: interface {} is nil, not string

goroutine 1449192 [running]:
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).closeWrapper(0xc4298bc000, 0x2ff6340, 0xc425636f80, 0x2758be0, 0xc425d6e6c0)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:677 +0x36f
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn.func1(0xc4298bc000, 0x2ff6340, 0xc425636f80)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:418 +0x61
panic(0x2758be0, 0xc425d6e6c0)
    /usr/local/go/src/runtime/panic.go:502 +0x229
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).serialize(0xc422df7500, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:2084 +0xa00
github.com/cockroachdb/cockroach/pkg/sql.(*SessionRegistry).SerializeAll(0xc42048d170, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/exec_util.go:854 +0x1bb
github.com/cockroachdb/cockroach/pkg/server.(*statusServer).ListLocalSessions(0xc4206b0360, 0x2ff6400, 0xc42bfa4b40, 0xc4256d8060, 0x28b6da0, 0x1, 0xc428631720)
    /go/src/github.com/cockroachdb/cockroach/pkg/server/status.go:1361 +0x9e

Could this have been fixed already in rc2?

knz commented 6 years ago

@joeledstrom thanks for finding this out.

Can you help us understand this better: how did you set application_name in your client app(s)?

joeledstrom commented 6 years ago

Ah okay that explains it, I was using pgweb to access the cluster when this happened actually, didn't know it was related. But it appears it has an empty application_name.

knz commented 6 years ago

Can you confirm the problem disappears if the application_name is set in pgweb?

knz commented 6 years ago

Also, can you share which command line you used to start pgweb? I'd like to run it with the same arguments.

joeledstrom commented 6 years ago

I used SET application_name='pgweb' from within pgweb. But it doesn't really appear to fix it, because I don't think it applies to all pgwebs connections.

I just use the default sosedoff/pgweb docker image as is on kubernetes With env variable:

        - name: DATABASE_URL
          value: postgres://root@cockroach:26257/database?sslmode=disable

knz commented 6 years ago

Thank you. What action inside pgweb caused the cluster to crash?

(I am trying pgweb right now and I want to understand how to reproduce the issue.)

joeledstrom commented 6 years ago

Im not at my computer right now (went for lunch) but I browsed a few tables in crdb_internal. Like cluster and node sessions.

(And also there is a dead node since a long time in the cluster if that makes any difference)

knz commented 6 years ago

Thanks.

knz commented 6 years ago

I found the error.

knz commented 6 years ago

Repro steps:

start pgweb
in the list of tables under crdb_internal, click around cycling between node_build_info, node_metrics, node_queries, node_runtime_info and node_sessions.

The "clicking around" part is important: if one looks at the table exactly in the order presented and just once, the error does not reproduce (or at least not reliably).

The failing query is alternatively one of the following:

SELECT count(1) FROM crdb_internal.node_queries
SELECT * FROM crdb_internal.node_queries LIMIT 100

knz commented 6 years ago

This may (or may not) be related to #31766

knz commented 6 years ago

cc @andreimatei @jordanlewis we need to look at this together. This will make an entire cluster randomly go down on show cluster session / queries.

jordanlewis commented 5 years ago

@knz didn't you fix this? If not, we should merge #33138.

knz commented 5 years ago

I have re-checked and this still repros even with #32755 but your PR #33138 seems to alleviate. So I agree we should merge it.

cockroachdb / cockroach

sql: invalid connEx.application_name causes entire cluster to go down when querying node_queries/_sessions #31998