Refactor metrics to be more accurate and usefull

banks commented 8 years ago

I'm looking into the metrics that Centrifugo exposes via stats api call as we move to put it into production.

I have a couple of issues that are not show-stoppers but I think can be better:

Our typical monitoring is to use diamond collector on localhost to gather metrics once every 5 seconds from process. Current stats API is awkward because:
- Output includes all nodes in cluster and to figure out which is current one means poking around trying to match name to _, That's not just fiddly, it feels like a hack to rely on knowing how centrifugo chooses to name nodes, and if --name is given at runtime in someone's setup it will break. Relying on --name is marginally better but requires that your centrifugo config and diamond config agree on the name to look for, and that you coordinate unique names across the cluster somehow. All of this is possible with good config management but it seems way to hard for something so simple.
- The counter values reset periodically (every node_metrics_interval). Which is OK if you are consuming them directly, but for counters it makes it impossible for external process like diamond to track actual number of requests in between polls. I.e. you can't get better than configured granularity.
- In addition some of the actual stats (the times) are not really very useful. Not a show stopper (I don't need to monitor those right now), but:
- exponential decay sampled histogram is already a little difficult to reason about - it shows "recently" weighted results but you don't know _how_recent. But then Centrifugo only shows you the value once every node_metrics_interval of a moving decaying sample... So what you actually see is the "recent weighted" mean/max value as of up to 1 minute ago.
- Actually using these for latencies is plain wrong, I went into detail but it's better to just link to: https://groups.google.com/forum/#!msg/mechanical-sympathy/I4JfZQ1GYi8/ocuzIyC3N9EJ which is talking directly about the Dropwizard implementation that go-metrics is a port of.

As well as that, there is also a minor race condition:

updateMetricsOnce locks the metrics mutex and copies values from counters and histograms
then it unlocks
then it clears the counters
any request that came in in between reading counter values and clearing will be lost (we correctly don't lock mutex to increment since go-metrics already handles concurrency)
Proposal

Here is one option that I think solves all my issues, it's quite a big change and there are certainly variations that would still work for our case if there are downsides I don't see with this solution.

Stop using go-metrics. It's useful and has some interesting properties, but we are already losing almost all the value by adding the interval based update to publically visible values. In addition, sampled histograms are wrong way to measure latency, and without those, the rest of what we use can just be replaced with integer counters.
Make there be a local_node_only param to stats OR create new api endpoint localstats
Get rid of node_metrics_interval and periodic updating - just return current counter values on each call to stats (or each ping message sent to other servers)
- If others really rely on the 1 minutes summary counts we could include both, but simple counters are much more useful in general
- Just use plain old atomic uint64 for all counters and don't worry about locking
Get rid of the timers. If we really want them we should use something like https://www.godoc.org/github.com/vulcand/vulcan/metrics#RollingHistogram which uses a robust measurement technique (HDRHistrogram) and supports rolling window of these to have an accurate distribution of the last N minutes etc. If we did that, I'd recommend having something like 15 buckets of 1 min, with each histogram having range 100µs to 10 mins (600mill µs) with 3 significant figures. That gives you very high definition for last 15 mins in a bit over 1MB per timer (on my mac). That seems very reasonable if we have ~20 timers on a modern machine. We could make it configurable if we want to support people on super low memory VPS or something. Then expose in the results from stats at least:
- mean
- 50%ile
- 99%ile
- 99.99%ile
- max
- Looks like it's not threadsafe though so would need to put a mutex around it or send results on a lightly buffered channel and have it update in single goroutine

FZambia commented 8 years ago

Hello! Thanks for detailed proposal!

I agree with almost everything, the only thing: at work we send metrics directly to Graphite and just want to see actual values aggregated over one minute interval, so for us it's important to keep current behaviour with counters. So it seems we need to support both ways

And I especially agree with you thoughts about current timer behaviour - absolutely useless:)

banks commented 8 years ago

send metrics directly to Graphite

Interesting, how does that work - I haven't seen options to configure output to graphite in the code?

Keeping optional time-based counters seems reasonable - I did originally suggest that but my ticket got too long so I simplified ;)

banks commented 8 years ago

Hm I see go-metrics supports graphite output but I don't see any code that configures that, plus if you did it at go-metrics level it wouldn't use the existing metrics interval stuff anyway...

FZambia commented 8 years ago

Our main site written in Django and we use Celery to do async or periodic jobs, so we just ask stats from Centrifugo every minute and send returned metrics into Graphite.

@app.task
def export_centrifugo_metrics():
    client = Client(settings.CENTRIFUGO_ADDRESS, settings.CENTRIFUGO_SECRET, timeout=2)
    stats, error = client.stats()
    if error:
        raise error

    fields = (
        "cpu_usage", "num_goroutine", "num_msg_published",
        "time_api_max", "num_api_requests", "time_client_max",
        "time_client_mean", "num_unique_clients", "num_msg_queued",
        "memory_sys", "bytes_client_in", "num_cpu", "num_clients",
        "time_api_mean", "num_client_requests", "num_msg_sent",
        "bytes_client_out", "gomaxprocs", "num_channels"
    )

    nodes = stats["nodes"]
    for node_info in nodes:
        name = node_info["name"]
        for field in fields:
            value = node_info.get(field)
            graphite_key = "centrifugo.%s.%s" % (name, field)
            send_to_graphite([graphite_key, value, time.time()])

FZambia commented 8 years ago

lol, just thought that we have a problem with counters with this approach:)

UPD: no problem - forgot how metrics in Centrifugo work))

banks commented 8 years ago

Yeah that is equivalent to what Diamond does although diamond already supports remembering the last value of a counter and only submitting the differnce each 10 seconds (or however often it runs).

In your setup, sync between when celery job runs and when Centrifugo aggregates could cause races where same values get submitted fro 2 minutes and one real minute get lost.

This is why plain counters are more general and the external tool should manage figuring out rate of change base don when it actually executes.

banks commented 8 years ago

Anyway we can keep the 1 min summaries if you want - it does save you from keeping persistent state elsewhere in your setup.

banks commented 8 years ago

From gitter discussion:

First option would be to just keep single stats method with no params and satisfy the requirements here by:

keeping aggregated counts
also have raw counts
include a top level key local_node_name or UUID such that consumer can easily filter to just local node without more options or relying on coordinated config.

But this has one unpleasant drawback: raw counts from non-local nodes will be subtly different semantically from the local node since they will only update once every ping interval. It's also generally quite messy.

The alternative which I think is cleaner and I prefer is to add a new method stats_raw or similar which has following semantics:

Only for local node
Outputs raw counter values (plus the other things like num cpus/goroutines etc.)

Either way we should also:

remove timers (keep them with 0 value in stats output for now to not upset any code relying on them)
remove go-metrics and just store struct of uint64 for counters with atomic increments and loads
- metrics struct will need to keep separate set of counters for aggregation (i.e. remember the total value last interval as well as the value we return)
- each interval we simply atomic.LoadUint64(&counter) the raw counter and store the difference between that and the count we saved last update, then save the new raw value
- then when we generate stats response we return only the difference

FZambia commented 8 years ago

@banks created separate issue for timers, closing this one, thanks a lot!

centrifugal / centrifugo

Refactor metrics to be more accurate and usefull #68

Proposal