Open jordanlewis opened 3 years ago
We'll also want to resolve #71351 somehow if we are to rely on this, or choose to sample the information and not collect it all the time.
Since https://github.com/cockroachdb/cockroach/pull/64503 was merged, there have been a number of other pebble iterator stats added to tracing (see https://github.com/cockroachdb/cockroach/pull/77512 and https://github.com/cockroachdb/cockroach/pull/94345). Some of these statistics might be a little easier to interpret than step/seek information.
So I have a few questions regarding this issue.
EXPLAIN ANALYZE
/virtual tables, or just the meaningful ones?
We could just surface all sampled iterator stats available via tracing (keeping https://github.com/cockroachdb/cockroach/issues/71351 in mind) in their "raw" form, and then add additional fields that combine the stats with each other or with other stats so they are a little easier to interpret.An additional thought: The current title of this issue suggests that we treat this issue as part of https://github.com/cockroachdb/cockroach/issues/77580, which calls for all pebble iterator stats to be added to EXPLAIN ANALYZE
and execution_statistics
(I assume in a format similar to that in which they are exposed from pebble). So perhaps we need to rename this issue "sql: improve MVCC iterator stats observability" or something like that?
Based on my discussion with @ericharmeling , i've updated the issue to also reflect the investigation on surfacing (new) iterator stats that are easily understandable by end users.
We could just surface all sampled iterator stats available via tracing (keeping https://github.com/cockroachdb/cockroach/issues/71351 in mind) in their "raw" form, and then add additional fields that combine the stats with each other or with other stats so they are a little easier to interpret.
This sounds like the most straightforward approach.
PointCount
/PointsCoveredByRangeTombstones
) for exposure to the statement_statistics
and transaction_statistics
tables.~The concern with this approach is that it ignores https://github.com/cockroachdb/cockroach/issues/71351. So perhaps in resolving https://github.com/cockroachdb/cockroach/issues/77580, we look into improving the exec stats recording by following one of the suggestions in that issue. This one is the easiest for me to understand at the moment:~
~do we want to sample all concurrent queries as "first"? can we improve so that we collect the sample of only the truly first? we'll need to make sure if we do collect the sample on the very first query, if that run is unsuccessful, we'll try sampling it later.~
~Update: I ran some benchmark tests, per https://github.com/cockroachdb/cockroach/issues/71351. See https://github.com/cockroachdb/cockroach/pull/96016#issuecomment-1405606891 for more details.~
It would seem that we could have the ratio of point count to seek count displayed as a metric. Where statements with high values could be displayed in a dashboard. Below are two different versions of the same statement after GC collect.
select
cast(statistics->'execution_statistics'->'mvccIteratorStats'->'pointCount'->'mean' as INTEGER) as pointCount,
cast(statistics->'execution_statistics'->'mvccIteratorStats'->'seekCount'->'mean' as INTEGER) as seekCount
from crdb_internal.statement_statistics
where fingerprint_id='\x9b1ad34c7012c1fb';
pointcount | seekcount
-------------+------------
5 | 2
4980000 | 2
Performance issues can occur when there are high quantities of MVCC garbage data. MVCC garbage data can lead to downstream issues like resource saturation that we’ve seen in several customer workloads (e.g., the outbox pattern).
We should surface MVCC garbage information specifically in SQL Observability touchpoints such as our internal telemetry, aggregated statement and transaction statistics tables, and the console (SQL Activity pages and Insights).
This issue tracks surfacing MVCC garbage information in our aggregated statement and transaction statistics tables and console pages. We should strive for a simple and explainable metric and UX that points users to MVCC garbage accumulation. Ideally the UX should be consistent with MVCC values surfaced in the Databases page where we describe "Live" and "Non-Live" data. Specifically, we should introduce liveBytesRead and nonLiveBytesRead per execution.
Related issues:
64503 added MVCC step/seek information to
EXPLAIN(ANALYZE)
. Since we're collecting this information now, we should be able to at least sample it and include it in our statistics.cc @maryliag @kevin-v-ngo @dongniwang
Jira issue: CRDB-10925
Jira issue: CRDB-13485
Epic: CRDB-20499