cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30k stars 3.79k forks source link

sql: SHOW RANGES WITH DETAILS shows incorrect MVCC stats #108766

Closed erikgrinaker closed 1 year ago

erikgrinaker commented 1 year ago
\x
CREATE TABLE test (n INT);
INSERT INTO test SELECT generate_series(1, 1000000);
SHOW RANGES FROM TABLE test WITH DETAILS;

-[ RECORD 1 ]
start_key             | …/<TableMin>
end_key               | <after:/Max>
range_id              | 64
range_size_mb         | 52.081746000000000000
lease_holder          | 2
lease_holder_locality | cloud=local,region=local,zone=local
replicas              | {1,2,3}
replica_localities    | {"cloud=local,region=local,zone=local","cloud=local,region=local,zone=local","cloud=local,region=local,zone=local"}
voting_replicas       | {1,3,2}
non_voting_replicas   | {}
learner_replicas      | {}
split_enforced_until  | NULL
range_size            | 52081746
span_stats            | {"approximate_disk_bytes": 43660186, "intent_bytes": 0, "intent_count": 0, "key_bytes": 75000000, "key_count": 3000000, "live_bytes": 156245238, "live_count": 3000000, "sys_bytes": 879, "sys_count": 18, "val_bytes": 81245238, "val_count": 3000000}

key_count: 3000000 is not correct, there's 1 million keys in the range, not 3 million. Seems like it's adding up results from all replicas or something.

@zachlite Is this related to your recent span stats work?

Jira issue: CRDB-30628 Epic: CRDB-30635

zachlite commented 1 year ago

@erikgrinaker, can you confirm what your expectations are for the other values? What would you expect to be accumulated vs not accumulated from across each node? Currently, everything is accumulated.

erikgrinaker commented 1 year ago

I'd expect to see the range's logical pre-replication stats, basically the range's MVCC stats that you get from crdb_internal.range_stats(). We care about how many logical keys exist in the range, and their logical size.

erikgrinaker commented 1 year ago

Also, only the leaseholder is guaranteed to have the right values. Followers can be arbitrarily stale, and can't be relied on.

zachlite commented 1 year ago

That's fair enough. Is there ever a use-case for what currently exists? I'm wondering how the current implementation got this far without anyone realizing until now. cc @knz

If I need to backtrack, that's ok. The fan-out machinery is good, but it needs to be reconfigured to only visit the leaseholders of the ranges that compose a requested span.

erikgrinaker commented 1 year ago

Is there ever a use-case for what currently exists?

It seems kind of misleading to me. I don't think anyone really cares about post-replication MVCC stats. For the disk usage stats, one might want to get the actual physical sum across replicas (if that's what the user cares about), but we have to be very clear about pre/post-replicated stats and logical/physical stats.

I touched briefly on this on a different issue, where I suggested fetching this information via a new KV RPC call, which would always route to the leaseholder and deal with retries and outages and stuff like that.

knz commented 1 year ago

The issue was identified previously here: #108779 This issue Erik you have identified is only a symptom of it. However, it is a "serious" symptom - it wasn't clear when I enumerated the concerns in #108779 that we cared about SHOW RANGES returning stale mvcc stats.

erikgrinaker commented 1 year ago

I don't think the sum issue is covered there, but the stale stats definitely is. Or maybe this covers it:

it seems strange we hit all nodes with at least one replica. Should we not restrict the fanout to the leaseholders only?

knz commented 1 year ago

This was intended to cover it:

only data from local replicas is collected, which may be stale

erikgrinaker commented 1 year ago

The main problem here is the summing of stats across replicas. The stale stats is also an issue, but staleness is sometimes a problem, while summing is always a problem.

zachlite commented 1 year ago

The main problem here is the summing of stats across replicas. The stale stats is also an issue, but staleness is sometimes a problem, while summing is always a problem.

This feels like a reasonable threshold for release blocker / not release blocker.