Closed joeledstrom closed 5 years ago
@joeledstrom thanks for finding this out.
Can you help us understand this better: how did you set application_name
in your client app(s)?
Ah okay that explains it, I was using pgweb to access the cluster when this happened actually, didn't know it was related. But it appears it has an empty application_name.
Can you confirm the problem disappears if the application_name is set in pgweb?
Also, can you share which command line you used to start pgweb
? I'd like to run it with the same arguments.
I used SET application_name='pgweb'
from within pgweb. But it doesn't really appear to fix it, because I don't think it applies to all pgwebs connections.
I just use the default sosedoff/pgweb
docker image as is on kubernetes
With env variable:
- name: DATABASE_URL
value: postgres://root@cockroach:26257/database?sslmode=disable
Thank you. What action inside pgweb caused the cluster to crash?
(I am trying pgweb right now and I want to understand how to reproduce the issue.)
Im not at my computer right now (went for lunch) but I browsed a few tables in crdb_internal. Like cluster and node sessions.
(And also there is a dead node since a long time in the cluster if that makes any difference)
Thanks.
I found the error.
Repro steps:
pgweb
crdb_internal
, click around cycling between node_build_info
, node_metrics
, node_queries
, node_runtime_info
and node_sessions
. The "clicking around" part is important: if one looks at the table exactly in the order presented and just once, the error does not reproduce (or at least not reliably).
The failing query is alternatively one of the following:
SELECT count(1) FROM crdb_internal.node_queries
SELECT * FROM crdb_internal.node_queries LIMIT 100
This may (or may not) be related to #31766
cc @andreimatei @jordanlewis we need to look at this together. This will make an entire cluster randomly go down on show cluster session
/ queries
.
@knz didn't you fix this? If not, we should merge #33138.
I have re-checked and this still repros even with #32755 but your PR #33138 seems to alleviate. So I agree we should merge it.
All nodes panicked at the same time in a 3 node cluster running on GKE, on persistent volumes.
One node panicked with
And the other two with
Could this have been fixed already in rc2?