Since #307 we now have generic go metrics, like mem, gc, threads etc.
Let's add application level metrics for the operator iself, that could be useful for Grafana Board and alerts. Suggestions:
Gauge of nuber of currently managed CRD instances for SolrClouds, SolrBackups, SolrPrometheusExporter
Gauge for CRDs currently in a failure state
Reconcile stats
Successful vs failed reconcile events, broken down to what kind of event
Size of pending operations in reconcile queue (if such a thing)
Operation stats
For each operation type (install, upgrade, delete, backup etc) counts and status
Goal would be to make a simple Grafana board where you can filter on namespace etc to see raw operator health, and at a glance whether some operations are in failure state etc. Futher filter by labels like SolrCloud name, so you can see number of failed operations towards each cluster, and when they happened.
Since #307 we now have generic
go
metrics, like mem, gc, threads etc.Let's add application level metrics for the operator iself, that could be useful for Grafana Board and alerts. Suggestions:
Goal would be to make a simple Grafana board where you can filter on namespace etc to see raw operator health, and at a glance whether some operations are in failure state etc. Futher filter by labels like SolrCloud name, so you can see number of failed operations towards each cluster, and when they happened.