cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.11k stars 3.81k forks source link

CLI cockroach node status does not show correct number of ranges for each node #99702

Open daniel-crlabs opened 1 year ago

daniel-crlabs commented 1 year ago

Describe the problem

The CLI, when we run the command cockroach node status --all shows the total number of ranges for the entire cluster, and not the range count for a given node, as one might expect.

To Reproduce

The CLI output shows the total ranges, this matches with the UI when you select the entire cluster. However, when you select a specific host in the UI, the UI displays the number of ranges for that host only, whereas the CLI does not do this and shows the range count for all nodes.

  1. UI shows the number of all ranges when selecting CLUSTER from dropdown, i.e 51 (this is correct, nothing wrong here)
1
  1. When we select a specific host, the UI is updated and only shows the number of ranges for that given host, i.e 16 (this is correct, nothing wrong here)
2
  1. Unexpected behavior: This is where the behavior at hand seems to be confusing. The CLI, when we run the command cockroach node status --all shows the total number of ranges for the entire cluster (ranges = 51), and not the range count for a given node (ranges = 16), as one might expect. Each node reports (resembling screenshot # 1 above), shouldn't this be the number of ranges in that particular node (resembling screenshot # 2 above)?
[root@cockroachdb-0 cockroach]# cockroach node status --all --certs-dir /cockroach/cockroach-certs --format records
-[ RECORD 1 ]
id                     | 1
address                | cockroachdb-0.cockroachdb.cockroach-sts-secure.svc.cluster.local:26257
sql_address            | cockroachdb-0.cockroachdb.cockroach-sts-secure.svc.cluster.local:26257
build                  | v22.2.5
started_at             | 2023-03-21 20:01:29.303301 +0000 UTC
updated_at             | 2023-03-22 14:25:30.662009 +0000 UTC
locality               | region=us-east,zone=us-east-1
is_available           | true
is_live                | true
replicas_leaders       | 16
replicas_leaseholders  | 16
ranges                 | 51
ranges_unavailable     | 0
ranges_underreplicated | 0
live_bytes             | 135692100
key_bytes              | 564240
value_bytes            | 136558228
range_key_bytes        | 0
range_value_bytes      | 0
intent_bytes           | 0
system_bytes           | 29023
gossiped_replicas      | 51
is_decommissioning     | false
membership             | active
is_draining            | false
-[ RECORD 2 ]
id                     | 2
address                | cockroachdb-1.cockroachdb.cockroach-sts-secure.svc.cluster.local:26257
sql_address            | cockroachdb-1.cockroachdb.cockroach-sts-secure.svc.cluster.local:26257
build                  | v22.2.5
started_at             | 2023-03-22 13:02:01.309179 +0000 UTC
updated_at             | 2023-03-22 14:25:33.303563 +0000 UTC
locality               | region=us-east,zone=us-east-1
is_available           | true
is_live                | true
replicas_leaders       | 20
replicas_leaseholders  | 20
ranges                 | 51
ranges_unavailable     | 0
ranges_underreplicated | 0
live_bytes             | 135634805
key_bytes              | 564216
value_bytes            | 136500889
range_key_bytes        | 0
range_value_bytes      | 0
intent_bytes           | 0
system_bytes           | 29023
gossiped_replicas      | 51
is_decommissioning     | false
membership             | active
is_draining            | false
-[ RECORD 3 ]
id                     | 3
address                | cockroachdb-2.cockroachdb.cockroach-sts-secure.svc.cluster.local:26257
sql_address            | cockroachdb-2.cockroachdb.cockroach-sts-secure.svc.cluster.local:26257
build                  | v22.2.5
started_at             | 2023-03-22 13:02:02.69602 +0000 UTC
updated_at             | 2023-03-22 14:25:31.744918 +0000 UTC
locality               | region=us-east,zone=us-east-1
is_available           | true
is_live                | true
replicas_leaders       | 15
replicas_leaseholders  | 15
ranges                 | 51
ranges_unavailable     | 0
ranges_underreplicated | 0
live_bytes             | 135577510
key_bytes              | 564144
value_bytes            | 136443462
range_key_bytes        | 0
range_value_bytes      | 0
intent_bytes           | 0
system_bytes           | 29023
gossiped_replicas      | 51
is_decommissioning     | false
membership             | active
is_draining            | false

Expected behavior CLI output of cockroach node status --all should display the correct number of ranges for each given node.

Jira issue: CRDB-26029

gz#16399

aliher1911 commented 1 year ago

I looked onto what is actually shown on charts and in CLI.

In this particular case raised, we have 3 nodes and each holds replicas for all ranges in the system. While leaseholders is 1/3 of ranges. If we run experiment with 6 nodes, then we will have a subset of ranges reported by CLI in ranges and it will differ from node to node. Same for ranges in UI where it would be 1/6 of ranges.

Not sure what would be a solution here beside docs as changing naming could generate confusion for existing customers who are used to current naming.

daniel-crlabs commented 1 year ago

Thank you for looking into this.

ranges in WebUI is number of leaseholders or (in case of no leaseholder on range, replicas that has this node as first replica in its list) so it is number of ranges that this node serves.

This is definitely confusing, especially since the UI has a specific graph for each of these (ranges, replicas and leaseholders per node) as you can see below:

Screenshot 2023-03-27 at 3 34 05 PM Screenshot 2023-03-27 at 3 34 14 PM Screenshot 2023-03-27 at 3 34 24 PM

The point of this issue however, is more specifically as it relates to the CLI:

ranges in CLI is number of replicas on particular node (this corresponds to Replicas per Node chart on same replication dashboard)

This is exactly the point of this bug report, this is not what the CLI is showing for ranges. The CLI is NOT showing the number of replicas on particular node, it is showing the number of replicas for all nodes combined. In the example below, 52 is the total number of ranges for the cluster, so if this was correct, it should show 16 (number of ranges on a particular node).

Are you saying the CLI ranges = WebUI replicas per node ? If so, it seems the CLI needs to be fixed, so instead of saying ranges, it should say replicas per node.

[root@cockroachdb-0 cockroach]# cockroach node status --all --certs-dir /cockroach/cockroach-certs --format records | egrep "id|replicas_leaders|replicas_leaseholders|ranges"
id                     | 1
replicas_leaders       | 16
replicas_leaseholders  | 16
ranges                 | 52
ranges_unavailable     | 0
ranges_underreplicated | 0

id                     | 2
replicas_leaders       | 16
replicas_leaseholders  | 16
ranges                 | 52
ranges_unavailable     | 0
ranges_underreplicated | 0

id                     | 3
replicas_leaders       | 20
replicas_leaseholders  | 20
ranges                 | 52
ranges_unavailable     | 0
ranges_underreplicated | 0
aliher1911 commented 1 year ago

Are you saying the CLI ranges = WebUI replicas per node ? If so, it seems the CLI needs to be fixed, so instead of saying ranges, it should say replicas per node.

I think that would be reasonable. Maybe just replicas would do as we have replica_leaseholders which is a subset of our counter in question.

daniel-crlabs commented 1 year ago

That sounds good, just trying to make it more consistent, whatever we decide to call it :-)