Open sumeerbhola opened 1 week ago
We probably want to know if we have infrastructure to support the cardinality involved in supporting these types of metrics.
@dhartunian do you have any thoughts on this? Specifically about this part:
This will possibly need additional observability infrastructure since a long running kv server can cycle through thousands of tenants, and we should not keep exporting expensive histograms for tenants that are not active on a kv server. The typical solution to this problem in multi-tenant systems is to export delta metrics instead of cumulative metrics, where when the delta is zero for a timeseries, nothing is exported. So the number of timeseries becomes proportional to the number of active tenants in a server.
cc @dshjoshi
CockroachDB has some support for per-tenant metrics. In multi-tenant environments like CockroachDB standard/basic, a tenant (or cluster operators) should be able to see queueing delay in AC WorkQueues on a per-tenant basis (and the aggregate across all tenants). Currently WorkQueue metrics are only segmented by priority.
This will possibly need additional observability infrastructure since a long running kv server can cycle through thousands of tenants, and we should not keep exporting expensive histograms for tenants that are not active on a kv server. The typical solution to this problem in multi-tenant systems is to export delta metrics instead of cumulative metrics, where when the delta is zero for a timeseries, nothing is exported. So the number of timeseries becomes proportional to the number of active tenants in a server.
The same approach can then be applied to replication AC queuing latency metrics.
Jira issue: CRDB-44325