cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.96k stars 3.79k forks source link

metrics: need a way to see all metrics that are affected by server.child_metrics.enabled #124343

Open rafiss opened 4 months ago

rafiss commented 4 months ago

Is your feature request related to a problem? Please describe. The server.child_metrics.enabled cluster setting enables exporting child metrics with additional labels in Prometheus. There's no way of seeing which metrics would be affected if the setting is enabled.

Describe the solution you'd like Document which metrics are affected. Ideally, this could be something that's documented automatically in docs/generated/metrics/metrics.html.

Describe alternatives you've considered Look at usages of the AggGauge/AggCounter/AggHistogram/etc libraries in the code to get a sense of which ones are impacted.

Additional context This question came up in: https://cockroachlabs.slack.com/archives/C012GFANG5R/p1715910844186129?thread_ts=1715900517.125659&cid=C012GFANG5R

Jira issue: CRDB-38839

abarganier commented 4 months ago

FWIW, I manually came up with a list of all the agg-metrics by looking for usages of the agg metrics library on master (see below). I believe the generated docs make use of the metric metadata, which AFAIK does not include information on whether it's an agg-metric. We might have to do something like make type assertions against the individual metrics in the code gen to get a hold of this info.

Current list of agg-metrics (created manually by me, it's possible I missed a few):

- changefeed.error_retries
- changefeed.emitted_messages
- changefeed.emitted_batch_sizes
- changefeed.filtered_messages
- changefeed.message_size_hist
- changefeed.emitted_bytes
- changefeed.flushed_bytes
- changefeed.flushes
- changefeed.size_based_flushes
- changefeed.parallel_io_queue_nanos
- changefeed.parallel_io_pending_rows
- changefeed.parallel_io_result_queue_nanos
- changefeed.parallel_io_in_flight_keys
- changefeed.sink_io_inflight
- changefeed.sink_batch_hist_nanos
- changefeed.flush_hist_nanos
- changefeed.commit_latency
- changefeed.admit_latency
- changefeed.backfill_count
- changefeed.backfill_pending_ranges
- changefeed.running
- changefeed.batch_reduction_count
- changefeed.internal_retry_message_count
- changefeed.schema_registry.retry_count
- changefeed.schema_registry.registrations
- changefeed.aggregator_progress
- changefeed.checkpoint_progress
- changefeed.lagging_ranges
- changefeed.cloudstorage_buffered_bytes
- changefeed.kafka_throttling_hist_nanos
- tenant.consumption.request_units
- tenant.consumption.kv_request_units
- tenant.consumption.read_batches
- tenant.consumption.read_requests
- tenant.consumption.read_bytes
- tenant.consumption.write_batches
- tenant.consumption.write_requests
- tenant.consumption.write_bytes
- tenant.consumption.sql_pods_cpu_seconds
- tenant.consumption.pgwire_egress_bytes
- tenant.consumption.external_io_egress_bytes
- tenant.consumption.external_io_ingress_bytes
- tenant.consumption.cross_region_network_ru
- livebytes
- keybytes
- valbytes
- rangekeybytes
- rangevalbytes
- totalbytes
- intentbytes
- lockbytes
- livecount
- keycount
- valcount
- rangekeycount
- rangevalcount
- intentcount
- lockcount
- intentage
- gcbytesage
- sysbytes
- syscount
- abortspanbytes
- kv.tenant_rate_limit.num_tenants
- kv.tenant_rate_limit.current_blocked
- kv.tenant_rate_limit.read_batches_admitted
- kv.tenant_rate_limit.write_batches_admitted
- kv.tenant_rate_limit.read_requests_admitted
- kv.tenant_rate_limit.write_requests_admitted
- kv.tenant_rate_limit.read_bytes_admitted
- kv.tenant_rate_limit.write_bytes_admitted
- security.certificate.expiration.ca
- security.certificate.expiration.client-ca
- security.certificate.expiration.ca-client-tenant
- security.certificate.expiration.ui-ca
- security.certificate.expiration.client
- security.certificate.expiration.client-tenant
- security.certificate.expiration.node
- security.certificate.expiration.node-client
- security.certificate.expiration.ui
- jobs.row_level_ttl.span_total_duration
- jobs.row_level_ttl.select_duration
- jobs.row_level_ttl.delete_duration
- jobs.row_level_ttl.rows_selected
- jobs.row_level_ttl.rows_deleted
- jobs.row_level_ttl.num_active_spans
- jobs.row_level_ttl.total_rows
- jobs.row_level_ttl.total_expired_rows
- rpc.connection.healthy
- rpc.connection.unhealthy
- rpc.connection.inactive
- rpc.connection.healthy_nanos
- rpc.connection.unhealthy_nanos
- rpc.connection.heartbeats
- rpc.connection.failures
- rpc.connection.avg_round_trip_latency