cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.03k stars 3.79k forks source link

sql,ui: Statement/Transaction stats system table not availalbe during mixed version state #73825

Closed lucacri closed 2 years ago

lucacri commented 2 years ago

Describe the problem

I just upgraded from 21.1.11 to 21.2.3, and everything is running correctly.

I tried to open the Statements and Transactions page in the Web Console, and I received the error

This page had an unexpected error while loading statements.

So I checked the XHR request and it's giving constant errors like:

Acombined-stmts-by-interval: read-stmt-stats: descriptor not foundAcombined-stmts-by-interval: read-stmt-stats: descriptor not found

I then tried to use the query suggested in the docs ( https://www.cockroachlabs.com/docs/stable/ui-statements-page.html ) in the "Statements statistics"

SELECT
    aggregated_ts,
    fingerprint_id,
    app_name,
    metadata -> 'query' AS statement_text,
    metadata -> 'stmtTyp' AS statement_type,
    metadata -> 'db' AS database_name,
    metadata -> 'distsql' AS is_distsql,
    metadata -> 'fullScan' AS has_full_scan,
    metadata -> 'vec' AS used_vec,
    statistics -> 'execution_statistics' -> 'contentionTime' -> 'mean' AS contention_time_mean,
    statistics -> 'statistics' -> 'cnt' AS execution_count,
    statistics -> 'statistics' -> 'firstAttemptCnt' AS number_first_attempts,
    statistics -> 'statistics' -> 'numRows' -> 'mean' AS number_rows_returned_mean,
    statistics -> 'statistics' -> 'rowsRead' -> 'mean' AS number_rows_read_mean,
    statistics -> 'statistics' -> 'runLat' -> 'mean' AS runtime_latecy_mean,
    sampled_plan
    FROM crdb_internal.statement_statistics;

Which returns an unexpected error:

[2021-12-14 16:28:55] [XX000] ERROR: internal error: read-stmt-stats: descriptor not found
[2021-12-14 16:28:55] Detail: stack trace:
[2021-12-14 16:28:55] /go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/errors.go:84: init()
[2021-12-14 16:28:55] /usr/local/go/src/runtime/proc.go:6309: doInit()
[2021-12-14 16:28:55] /usr/local/go/src/runtime/proc.go:6286: doInit()
[2021-12-14 16:28:55] /usr/local/go/src/runtime/proc.go:6286: doInit()
[2021-12-14 16:28:55] /usr/local/go/src/runtime/proc.go:6286: doInit()
[2021-12-14 16:28:55] /usr/local/go/src/runtime/proc.go:6286: doInit()
[2021-12-14 16:28:55] /usr/local/go/src/runtime/proc.go:6286: doInit()
[2021-12-14 16:28:55] /usr/local/go/src/runtime/proc.go:6286: doInit()
[2021-12-14 16:28:55] /usr/local/go/src/runtime/proc.go:6286: doInit()
[2021-12-14 16:28:55] /usr/local/go/src/runtime/proc.go:6286: doInit()
[2021-12-14 16:28:55] /usr/local/go/src/runtime/proc.go:6286: doInit()
[2021-12-14 16:28:55] /usr/local/go/src/runtime/proc.go:208: main()
[2021-12-14 16:28:55] /usr/local/go/src/runtime/asm_amd64.s:1371: goexit()
[2021-12-14 16:28:55] Hint: You have encountered an unexpected error.
[2021-12-14 16:28:55] Please check the public issue tracker to check whether this problem is
[2021-12-14 16:28:55] already tracked. If you cannot find it there, please report the error
[2021-12-14 16:28:55] with details by creating a new issue.
[2021-12-14 16:28:55] If you would rather not post publicly, please contact us directly
[2021-12-14 16:28:55] using the support form.
[2021-12-14 16:28:55] We appreciate your feedback.

To Reproduce

I think it's happening because of some weird state my cluster might be in, but in any case, I can reproduce it by opening the Statements page and/or running the previous query

Expected behavior Statements page should load

Additional data / screenshots I can provide any data you need, I'm also on slack with the same username

Environment:

Jira issue: CRDB-11765

blathers-crl[bot] commented 2 years ago

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Azhng commented 2 years ago

Hmm interesting, seems like there's a possibility that the underlying system table didn't get created properly? 🤔 . Do you mind post the result of

show create table system.statement_statistics
RaduBerinde commented 2 years ago

Also, show cluster setting version would be useful (is it possible we are in mixed-version state?)

lucacri commented 2 years ago

That was fast :)

@azhng you might be onto something:

show create table system.statement_statistics;

ERROR: internal error: descriptor not found
SQLSTATE: XX000
DETAIL: stack trace:
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/errors.go:84: init()
/usr/local/go/src/runtime/proc.go:6309: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:208: main()
/usr/local/go/src/runtime/asm_amd64.s:1371: goexit()

HINT: You have encountered an unexpected error.

Please check the public issue tracker to check whether this problem is
already tracked. If you cannot find it there, please report the error
with details by creating a new issue.

If you would rather not post publicly, please contact us directly
using the support form.

We appreciate your feedback.

and @RaduBerinde you are also right, it's showing version 21.1

RaduBerinde commented 2 years ago

Did you use the cluster.preserve_downgrade_option option to prevent auto-finalization of the upgrade? If no, are all the nodes in the cluster running the new version binary?

In any case, we have a bug here - we should be able to operate without error in mixed version state (even if functionality is reduced).

lucacri commented 2 years ago

Yes I did, and I issued the

RESET CLUSTER SETTING cluster.preserve_downgrade_option;

several times. The first time it took just few milliseconds.

All the nodes are upgraded to 21.2.3

--- edit with sql

root@my-cockroachdb-preemptible-public:26257/defaultdb> RESET CLUSTER SETTING cluster.preserve_downgrade_option;
SET CLUSTER SETTING

Time: 11ms total (execution 10ms / network 1ms)

root@my-cockroachdb-preemptible-public:26257/defaultdb> show cluster setting version;
  version
-----------
  21.1
(1 row)

Time: 2ms total (execution 2ms / network 0ms)
Azhng commented 2 years ago

Hmm I see. I think for the short term solution, this problem should go away if the cluster is properly upgraded to 21.2. But this definitely exposed the issue where SQL Obs feature is under-tested in mixed version environment. Unlike in the Cockroach Cloud environment where we have multiple frontend version and load the correct frontend based on the backend version, we assumed that the the DB Console frontend is always talking to the compatible backend. However, this assumption seems like definitely not always true.

For this particular issue, I think the fix should be pretty straightforward, we keep the old code-path in 21.1, and we version gate the API endpoint. If we are in 21.1, we load the in-memory stats (the 21.1 codepath), otherwise, we use the new 21.2 codepath.

However, mixed version state is probably going to break other SQL Obs features. I bet SELECT crdb_internal.reset_sql_stats() is gonna have issues too as well 🤷

cc @maryliag thoughts?

ajwerner commented 2 years ago

@srosenberg this issue may interest you.

lucacri commented 2 years ago

Any way for me to force the upgrade to 21.2 again? This is a production DB so I'm kind of worried being in a middle state

lucacri commented 2 years ago

I let a couple of days pass and it's still reporting as version 21.1 after show cluster setting version;

smcvey commented 2 years ago

What's the output of cockroach node status --decommission? Do you have nodes stuck in a decommissioning state? If you do, you'll have to re-decommission them with cockroach node decommission <node id>.

lucacri commented 2 years ago

I weirdly have my 17 servers and also a list of "NULL" ones that are supposed to be decommissioned:

                                                                                        id |                                                address                                                |                                              sql_address                                              |  build  |         started_at         |         updated_at         |                                   locality                                    | is_available | is_live | gossiped_replicas | is_decommissioning |   membership    | is_draining
-----+-------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+---------+----------------------------+----------------------------+-------------------------------------------------------------------------------+--------------+---------+-------------------+--------------------+-----------------+--------------
   1 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.806425 | NULL                                                                          | false        | false   | NULL              | true               | decommissioned  | false
   2 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.806439 | NULL                                                                          | false        | false   | NULL              | true               | decommissioned  | false
   3 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.807284 | NULL                                                                          | false        | false   | NULL              | true               | decommissioned  | false
   4 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.807292 | NULL                                                                          | false        | false   | NULL              | true               | decommissioned  | true
   5 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.807314 | NULL                                                                          | false        | false   | NULL              | true               | decommissioned  | true
   6 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.807321 | NULL                                                                          | false        | false   | NULL              | true               | decommissioned  | false
   7 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.807323 | NULL                                                                          | false        | false   | NULL              | true               | decommissioned  | false
   8 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.807332 | NULL                                                                          | false        | false   | NULL              | true               | decommissioned  | false
   9 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.807335 | NULL                                                                          | false        | false   | NULL              | true               | decommissioned  | false
  10 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.807338 | NULL                                                                          | false        | false   | NULL              | true               | decommissioned  | false
  11 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.80734  | NULL                                                                          | false        | false   | NULL              | true               | decommissioned  | false
  12 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.807345 | NULL                                                                          | false        | false   | NULL              | true               | decommissioned  | false
  13 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.807348 | NULL                                                                          | false        | false   | NULL              | true               | decommissioned  | false
  14 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.80735  | NULL                                                                          | false        | false   | NULL              | true               | decommissioned  | true
  15 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.807353 | NULL                                                                          | false        | false   | NULL              | true               | decommissioned  | false
  16 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.807355 | NULL                                                                          | false        | false   | NULL              | true               | decommissioned  | false
  17 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.807357 | NULL                                                                          | false        | false   | NULL              | true               | decommissioned  | false
  18 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.80736  | NULL                                                                          | false        | false   | NULL              | true               | decommissioning | true
  19 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.807363 | NULL                                                                          | false        | false   | NULL              | true               | decommissioning | false
  20 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.807365 | NULL                                                                          | false        | false   | NULL              | true               | decommissioning | true
  21 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.807369 | NULL                                                                          | false        | false   | NULL              | true               | decommissioning | true
  22 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.807371 | NULL                                                                          | false        | false   | NULL              | true               | decommissioning | true
  23 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.807374 | NULL                                                                          | false        | false   | NULL              | true               | decommissioning | true
  24 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.807376 | NULL                                                                          | false        | false   | NULL              | true               | decommissioning | true
  25 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.807379 | NULL                                                                          | false        | false   | NULL              | true               | decommissioning | true
  26 | NULL                                                                                                  | NULL                                                                                                  | NULL    | NULL                       | 2021-12-16 21:45:58.807382 | NULL                                                                          | false        | false   | NULL              | true               | decommissioning | true

(removed the real servers for security)

lucacri commented 2 years ago

I tried to decommission one of the "NULL" manually but it just does nothing important:

[root@cockroachdb-client-secure cockroach]# ./cockroach node decommission 1 --disable-cluster-name-verification --certs-dir=/cockroach-certs --host=my-cockroachdb-preemptible-public
warning: node 1 is already decommissioning or decommissioned

  id | is_live | replicas | is_decommissioning |   membership   | is_draining
-----+---------+----------+--------------------+----------------+--------------
   1 |  false  |        0 |        true        | decommissioned |    false
(1 row)

No more data reported on target nodes. Please verify cluster health before removing the nodes.
smcvey commented 2 years ago

1 was already decommissioned. You want to decommission the ones that say decommissioning. It looks like 18-26.

Running the decommission on those nodes will produce the same output as when you ran it on 1, but it should mark them as decommissioned once finished. Once they're all done, the cluster should finish upgrading.

lucacri commented 2 years ago

Good catch! I decommissioned them all as suggested.

I tried to re-run the ending of the upgrade, but it's in a weird version now:

root@my-cockroachdb-preemptible-public:26257/defaultdb> RESET CLUSTER SETTING cluster.preserve_downgrade_option;
SET CLUSTER SETTING

Time: 12ms total (execution 11ms / network 1ms)

root@my-cockroachdb-preemptible-public:26257/defaultdb> show cluster setting version;
   version
-------------
  21.1-1124
(1 row)

Time: 230ms total (execution 229ms / network 1ms)

before it was saying only 21.1 , now it's 21.1-1124

lucacri commented 2 years ago

I see, it just changed to 21.1-1228. I'm assuming it's doing the migration, and those "versions" are the migration steps

smcvey commented 2 years ago

If you have a look at the show jobs;, you should see events related to the migration. There will be a collection of tasks being performed and it won't be instant, but it does sound like things are finally moving for you.

lucacri commented 2 years ago

We have a winner!

root@my-cockroachdb-preemptible-public:26257/defaultdb> show cluster setting version;
  version
-----------
  21.2
(1 row)

Thank you for your help! My problem is fixed, but I assume you might want to keep this open in order to fix it for other people encountering the same problem?

RaduBerinde commented 2 years ago

Thanks @smcvey! Yes, we should keep this open to track fixing the internal query if we're not at the right version.

maryliag commented 2 years ago

issue for improving the message for this case: https://github.com/cockroachdb/cockroach/issues/77535

Azhng commented 2 years ago

Closed by #79795