admission: per tenant WorkQueue latency metrics

cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.

Other

30.16k stars 3.82k forks source link

CockroachDB has some support for per-tenant metrics. In multi-tenant environments like CockroachDB standard/basic, a tenant (or cluster operators) should be able to see queueing delay in AC WorkQueues on a per-tenant basis (and the aggregate across all tenants). Currently WorkQueue metrics are only segmented by priority.

This will possibly need additional observability infrastructure since a long running kv server can cycle through thousands of tenants, and we should not keep exporting expensive histograms for tenants that are not active on a kv server. The typical solution to this problem in multi-tenant systems is to export delta metrics instead of cumulative metrics, where when the delta is zero for a timeseries, nothing is exported. So the number of timeseries becomes proportional to the number of active tenants in a server.

The same approach can then be applied to replication AC queuing latency metrics.

Jira issue: CRDB-44325

We probably want to know if we have infrastructure to support the cardinality involved in supporting these types of metrics.

@dhartunian do you have any thoughts on this? Specifically about this part:

This will possibly need additional observability infrastructure since a long running kv server can cycle through thousands of tenants, and we should not keep exporting expensive histograms for tenants that are not active on a kv server. The typical solution to this problem in multi-tenant systems is to export delta metrics instead of cumulative metrics, where when the delta is zero for a timeseries, nothing is exported. So the number of timeseries becomes proportional to the number of active tenants in a server.

cc @dshjoshi

cockroachdb / cockroach

admission: per tenant WorkQueue latency metrics #134987