cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.2k stars 3.82k forks source link

metrics: make alerting rules opt-out, not opt-in [deprecated] #80727

Open tbg opened 2 years ago

tbg commented 2 years ago

Today, when you add a metric, adding an alerting rule is opt-in (if you even remember they exist).

It would be good for them to be opt-out instead. We don't add new metrics frequently, so this (admittedly annoying) nudge is worth it, since it makes it a lot more likely that alerting rules are added along with new metrics. As a second-order effect, it also reminds folks that they exist in the moment where they should remember (or learn).

We could do this by adding to this test:

https://github.com/cockroachdb/cockroach/blob/2f8938ffb741afca9fed265c3e7bad34d53c9193/pkg/server/status_test.go#L994

which forces each new metric to also be reflected in the chart catalog.

Adding this to CRDB-25656 for clean-up, CRDB-25656 adopts a different approach that would see us remove the alerting rules codification altogether. If we follow that line of thinking, we should purge the alerting rules code from the codebase to avoid confusion.

Jira issue: CRDB-15532 Epic: CRDB-25656

tbg commented 1 year ago

At the time of writing we have 9 alerting rules in code:

createAndRegisterNodeCertExpiryRule in github.com/cockroachdb/cockroach/pkg/server/serverrules/metric_rules.go
createAndRegisterNodeRestartRule in github.com/cockroachdb/cockroach/pkg/server/serverrules/metric_rules.go
createAndRegisterUnavailableRangesRule in github.com/cockroachdb/cockroach/pkg/kv/kvserver/metric_rules.go
createAndRegisterNodeCACertExpiryRule in github.com/cockroachdb/cockroach/pkg/server/serverrules/metric_rules.go
createAndRegisterUnderReplicatedRangesRule in github.com/cockroachdb/cockroach/pkg/kv/kvserver/metric_rules.go
createAndRegisterRequestsStuckInRaftRule in github.com/cockroachdb/cockroach/pkg/kv/kvserver/metric_rules.go
createAndRegisterTrippedReplicaCircuitBreakersRule in github.com/cockroachdb/cockroach/pkg/kv/kvserver/metric_rules.go
createAndRegisterHighOpenFDCountRule in github.com/cockroachdb/cockroach/pkg/kv/kvserver/metric_rules.go
createAndRegisterNodeCapacityLowRule in github.com/cockroachdb/cockroach/pkg/kv/kvserver/metric_rules.go

we have a similar number of aggregation rules, which seem to exist mostly to support the alerting rules:

createAndRegisterClusterCapacityAvailableRule in github.com/cockroachdb/cockroach/pkg/kv/kvserver/metric_rules.go
createAndRegisterClusterCapacityAvailableRatioRule in github.com/cockroachdb/cockroach/pkg/kv/kvserver/metric_rules.go
createAndRegisterClusterCapacityRule in github.com/cockroachdb/cockroach/pkg/kv/kvserver/metric_rules.go
createAndRegisterNodeCapacityAvailableRule in github.com/cockroachdb/cockroach/pkg/kv/kvserver/metric_rules.go
createAndRegisterNodeCapacityRule in github.com/cockroachdb/cockroach/pkg/kv/kvserver/metric_rules.go
createAndRegisterCapacityAvailableRatioRule in github.com/cockroachdb/cockroach/pkg/kv/kvserver/metric_rules.go
createAndRegisterNodeCapacityAvailableRatioRule in github.com/cockroachdb/cockroach/pkg/kv/kvserver/metric_rules.go

None of this is used for anything.