Closed lucacri closed 2 years ago
Hello, I am Blathers. I am here to help you get the issue triaged.
Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.
I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.
Hmm interesting, seems like there's a possibility that the underlying system table didn't get created properly? 🤔 . Do you mind post the result of
show create table system.statement_statistics
Also, show cluster setting version
would be useful (is it possible we are in mixed-version state?)
That was fast :)
@azhng you might be onto something:
show create table system.statement_statistics;
ERROR: internal error: descriptor not found
SQLSTATE: XX000
DETAIL: stack trace:
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/errors.go:84: init()
/usr/local/go/src/runtime/proc.go:6309: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:208: main()
/usr/local/go/src/runtime/asm_amd64.s:1371: goexit()
HINT: You have encountered an unexpected error.
Please check the public issue tracker to check whether this problem is
already tracked. If you cannot find it there, please report the error
with details by creating a new issue.
If you would rather not post publicly, please contact us directly
using the support form.
We appreciate your feedback.
and @RaduBerinde you are also right, it's showing version 21.1
Did you use the cluster.preserve_downgrade_option
option to prevent auto-finalization of the upgrade? If no, are all the nodes in the cluster running the new version binary?
In any case, we have a bug here - we should be able to operate without error in mixed version state (even if functionality is reduced).
Yes I did, and I issued the
RESET CLUSTER SETTING cluster.preserve_downgrade_option;
several times. The first time it took just few milliseconds.
All the nodes are upgraded to 21.2.3
--- edit with sql
root@my-cockroachdb-preemptible-public:26257/defaultdb> RESET CLUSTER SETTING cluster.preserve_downgrade_option;
SET CLUSTER SETTING
Time: 11ms total (execution 10ms / network 1ms)
root@my-cockroachdb-preemptible-public:26257/defaultdb> show cluster setting version;
version
-----------
21.1
(1 row)
Time: 2ms total (execution 2ms / network 0ms)
Hmm I see. I think for the short term solution, this problem should go away if the cluster is properly upgraded to 21.2. But this definitely exposed the issue where SQL Obs feature is under-tested in mixed version environment. Unlike in the Cockroach Cloud environment where we have multiple frontend version and load the correct frontend based on the backend version, we assumed that the the DB Console frontend is always talking to the compatible backend. However, this assumption seems like definitely not always true.
For this particular issue, I think the fix should be pretty straightforward, we keep the old code-path in 21.1, and we version gate the API endpoint. If we are in 21.1, we load the in-memory stats (the 21.1 codepath), otherwise, we use the new 21.2 codepath.
However, mixed version state is probably going to break other SQL Obs features. I bet SELECT crdb_internal.reset_sql_stats()
is gonna have issues too as well 🤷
cc @maryliag thoughts?
@srosenberg this issue may interest you.
Any way for me to force the upgrade to 21.2 again? This is a production DB so I'm kind of worried being in a middle state
I let a couple of days pass and it's still reporting as version 21.1 after show cluster setting version;
What's the output of cockroach node status --decommission
? Do you have nodes stuck in a decommissioning
state? If you do, you'll have to re-decommission them with cockroach node decommission <node id>
.
I weirdly have my 17 servers and also a list of "NULL" ones that are supposed to be decommissioned:
id | address | sql_address | build | started_at | updated_at | locality | is_available | is_live | gossiped_replicas | is_decommissioning | membership | is_draining
-----+-------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+---------+----------------------------+----------------------------+-------------------------------------------------------------------------------+--------------+---------+-------------------+--------------------+-----------------+--------------
1 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.806425 | NULL | false | false | NULL | true | decommissioned | false
2 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.806439 | NULL | false | false | NULL | true | decommissioned | false
3 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.807284 | NULL | false | false | NULL | true | decommissioned | false
4 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.807292 | NULL | false | false | NULL | true | decommissioned | true
5 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.807314 | NULL | false | false | NULL | true | decommissioned | true
6 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.807321 | NULL | false | false | NULL | true | decommissioned | false
7 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.807323 | NULL | false | false | NULL | true | decommissioned | false
8 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.807332 | NULL | false | false | NULL | true | decommissioned | false
9 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.807335 | NULL | false | false | NULL | true | decommissioned | false
10 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.807338 | NULL | false | false | NULL | true | decommissioned | false
11 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.80734 | NULL | false | false | NULL | true | decommissioned | false
12 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.807345 | NULL | false | false | NULL | true | decommissioned | false
13 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.807348 | NULL | false | false | NULL | true | decommissioned | false
14 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.80735 | NULL | false | false | NULL | true | decommissioned | true
15 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.807353 | NULL | false | false | NULL | true | decommissioned | false
16 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.807355 | NULL | false | false | NULL | true | decommissioned | false
17 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.807357 | NULL | false | false | NULL | true | decommissioned | false
18 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.80736 | NULL | false | false | NULL | true | decommissioning | true
19 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.807363 | NULL | false | false | NULL | true | decommissioning | false
20 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.807365 | NULL | false | false | NULL | true | decommissioning | true
21 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.807369 | NULL | false | false | NULL | true | decommissioning | true
22 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.807371 | NULL | false | false | NULL | true | decommissioning | true
23 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.807374 | NULL | false | false | NULL | true | decommissioning | true
24 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.807376 | NULL | false | false | NULL | true | decommissioning | true
25 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.807379 | NULL | false | false | NULL | true | decommissioning | true
26 | NULL | NULL | NULL | NULL | 2021-12-16 21:45:58.807382 | NULL | false | false | NULL | true | decommissioning | true
(removed the real servers for security)
I tried to decommission one of the "NULL" manually but it just does nothing important:
[root@cockroachdb-client-secure cockroach]# ./cockroach node decommission 1 --disable-cluster-name-verification --certs-dir=/cockroach-certs --host=my-cockroachdb-preemptible-public
warning: node 1 is already decommissioning or decommissioned
id | is_live | replicas | is_decommissioning | membership | is_draining
-----+---------+----------+--------------------+----------------+--------------
1 | false | 0 | true | decommissioned | false
(1 row)
No more data reported on target nodes. Please verify cluster health before removing the nodes.
1 was already decommissioned
. You want to decommission the ones that say decommissioning
. It looks like 18-26.
Running the decommission on those nodes will produce the same output as when you ran it on 1, but it should mark them as decommissioned once finished. Once they're all done, the cluster should finish upgrading.
Good catch! I decommissioned them all as suggested.
I tried to re-run the ending of the upgrade, but it's in a weird version now:
root@my-cockroachdb-preemptible-public:26257/defaultdb> RESET CLUSTER SETTING cluster.preserve_downgrade_option;
SET CLUSTER SETTING
Time: 12ms total (execution 11ms / network 1ms)
root@my-cockroachdb-preemptible-public:26257/defaultdb> show cluster setting version;
version
-------------
21.1-1124
(1 row)
Time: 230ms total (execution 229ms / network 1ms)
before it was saying only 21.1 , now it's 21.1-1124
I see, it just changed to 21.1-1228. I'm assuming it's doing the migration, and those "versions" are the migration steps
If you have a look at the show jobs;
, you should see events related to the migration. There will be a collection of tasks being performed and it won't be instant, but it does sound like things are finally moving for you.
We have a winner!
root@my-cockroachdb-preemptible-public:26257/defaultdb> show cluster setting version;
version
-----------
21.2
(1 row)
Thank you for your help! My problem is fixed, but I assume you might want to keep this open in order to fix it for other people encountering the same problem?
Thanks @smcvey! Yes, we should keep this open to track fixing the internal query if we're not at the right version.
issue for improving the message for this case: https://github.com/cockroachdb/cockroach/issues/77535
Closed by #79795
Describe the problem
I just upgraded from 21.1.11 to 21.2.3, and everything is running correctly.
I tried to open the Statements and Transactions page in the Web Console, and I received the error
So I checked the XHR request and it's giving constant errors like:
I then tried to use the query suggested in the docs ( https://www.cockroachlabs.com/docs/stable/ui-statements-page.html ) in the "Statements statistics"
Which returns an unexpected error:
To Reproduce
I think it's happening because of some weird state my cluster might be in, but in any case, I can reproduce it by opening the Statements page and/or running the previous query
Expected behavior Statements page should load
Additional data / screenshots I can provide any data you need, I'm also on slack with the same username
Environment:
Jira issue: CRDB-11765