Azure / ARO-RP

Azure Red Hat OpenShift RP
https://azure.microsoft.com/products/openshift/
Apache License 2.0
101 stars 170 forks source link

[Proposal][wip]RP metrics #38

Closed mjudeikis closed 4 years ago

mjudeikis commented 4 years ago

RP should emit statsd metrics (compatibility with geneva).

We have 2 ways to achieve this goal:

  1. export all metrics using statsd
  2. export all metrics using prometheus and convert to statsd using same way as we did inV3
Prometheus pros:
1. Wider adoption
2. Easier orchestration
3. RH stack compatibility
4. Easier future integration

Prom cons:
1. Percentiles computation may not give an overall picture as percentiles are calculated at each instance level, not at global level.
2. More memory is used as the metrics are locally stored in application memory.

Statsd pros:
1. Aggregation at the server level
2. Easy to implement (straight forward-protocol)
3. Percentiles and histograms are calculated on server-side relatively less overhead in the client application.

Statsd cons:
1. Compatibility
2. Local testing will be harder (with Prometheus we can run an instance of Prometheus and scrape and test/validate or just curl /metrics). With statsd we will need to send metrics to "sink" for testing `nc -u -w0 127.0.0.1 8125`

StatsD client example: https://github.com/statsd/statsd/blob/master/examples/go/statsd.go https://godoc.org/github.com/etsy/statsd/examples/go

The proposal would be to create subpackage pkg/metrics with methods to record different metrics with configured dimensions/tags.

By initiating RP some dimensions would need to be provided. Like region, name, location.

Metrics recording would be recorded as statsD as example below:

Examle metrics:

Statsd:
aro_frontend_call.v20191231-preview.openShiftcluster.timers.t:84.2|ms
aro_frontend_call.{api-version}.{openshiftcluster,asyncoperation,openShiftclustercredentialssubscription}.timers.t:84.2|ms
Prometheus:
aro_frontend_call{api_name="openshiftcluster", version="v20191231-preview", method="post", quuantile="0.5"}0.06
# if statsd exported is used, it will automatically convert into Prometheus summary
aro_frontend_call_sum{api_name="openshiftcluster", version="v20191231-preview", method="post"}0.79
aro_frontend_call_count{api_name="openshiftcluster", version="v20191231-preview", method="post"}20

Proposed metrics: Where we use dymensions to add metadata like: rp_ame, location, region, environement

aro_frontend_call.{version}.{api}.timers:v|ms - frontend api call execution times
aro_frontend_call.{version}.{api}.errors.{code}.counters:v|c- error counts (all 4xx, 5xx codes)
aro_frontend_call.{version}.{api}.success.{code}.counters:v|c- success counts (all 2xx)
aro_frontend_call.sessions.gauges:v|g- current open sessions

aro_backend.workers.gauges:10|g - current worker count
aro_backend.errors.counters: 10|c - total errors in backend

aro_cosmodb_call.{method}.{database_name}.timers:v|ms
aro_cosmodb_call.{method}.{database_name}.counters:v|c

@jim-minter @m1kola @asalkeld WDYT?

mjudeikis commented 4 years ago

In addition, we would need to ship golang runtime metrics, like:

func memStats() map[string]float64 {
    m := runtime.MemStats{}
    runtime.ReadMemStats(&m)
    metrics := map[string]float64{
        "memory.objects.HeapObjects": float64(m.HeapObjects),
        "memory.summary.Alloc":       float64(m.Alloc),
        "memory.counters.Mallocs":    perSecondCounter("mallocs", int64(m.Mallocs)),
        "memory.counters.Frees":      perSecondCounter("frees", int64(m.Frees)),
        "memory.summary.System":      float64(m.HeapSys),
        "memory.heap.Idle":           float64(m.HeapIdle),
        "memory.heap.InUse":          float64(m.HeapInuse),
    }

    return metrics
}
jim-minter commented 4 years ago

Tracked in Google doc.