centrifugal / centrifugo

Scalable real-time messaging server in a language-agnostic way. Self-hosted alternative to Pubnub, Pusher, Ably. Set up once and forever.
https://centrifugal.dev
Apache License 2.0
8.38k stars 594 forks source link

Refactor metrics to be more accurate and usefull #68

Closed banks closed 8 years ago

banks commented 8 years ago

I'm looking into the metrics that Centrifugo exposes via stats api call as we move to put it into production.

I have a couple of issues that are not show-stoppers but I think can be better:

As well as that, there is also a minor race condition:

Here is one option that I think solves all my issues, it's quite a big change and there are certainly variations that would still work for our case if there are downsides I don't see with this solution.

FZambia commented 8 years ago

Hello! Thanks for detailed proposal!

I agree with almost everything, the only thing: at work we send metrics directly to Graphite and just want to see actual values aggregated over one minute interval, so for us it's important to keep current behaviour with counters. So it seems we need to support both ways

And I especially agree with you thoughts about current timer behaviour - absolutely useless:)

banks commented 8 years ago

send metrics directly to Graphite

Interesting, how does that work - I haven't seen options to configure output to graphite in the code?

Keeping optional time-based counters seems reasonable - I did originally suggest that but my ticket got too long so I simplified ;)

banks commented 8 years ago

Hm I see go-metrics supports graphite output but I don't see any code that configures that, plus if you did it at go-metrics level it wouldn't use the existing metrics interval stuff anyway...

FZambia commented 8 years ago

Our main site written in Django and we use Celery to do async or periodic jobs, so we just ask stats from Centrifugo every minute and send returned metrics into Graphite.

@app.task
def export_centrifugo_metrics():
    client = Client(settings.CENTRIFUGO_ADDRESS, settings.CENTRIFUGO_SECRET, timeout=2)
    stats, error = client.stats()
    if error:
        raise error

    fields = (
        "cpu_usage", "num_goroutine", "num_msg_published",
        "time_api_max", "num_api_requests", "time_client_max",
        "time_client_mean", "num_unique_clients", "num_msg_queued",
        "memory_sys", "bytes_client_in", "num_cpu", "num_clients",
        "time_api_mean", "num_client_requests", "num_msg_sent",
        "bytes_client_out", "gomaxprocs", "num_channels"
    )

    nodes = stats["nodes"]
    for node_info in nodes:
        name = node_info["name"]
        for field in fields:
            value = node_info.get(field)
            graphite_key = "centrifugo.%s.%s" % (name, field)
            send_to_graphite([graphite_key, value, time.time()])
FZambia commented 8 years ago

lol, just thought that we have a problem with counters with this approach:)

UPD: no problem - forgot how metrics in Centrifugo work))

banks commented 8 years ago

Yeah that is equivalent to what Diamond does although diamond already supports remembering the last value of a counter and only submitting the differnce each 10 seconds (or however often it runs).

In your setup, sync between when celery job runs and when Centrifugo aggregates could cause races where same values get submitted fro 2 minutes and one real minute get lost.

This is why plain counters are more general and the external tool should manage figuring out rate of change base don when it actually executes.

banks commented 8 years ago

Anyway we can keep the 1 min summaries if you want - it does save you from keeping persistent state elsewhere in your setup.

banks commented 8 years ago

From gitter discussion:

First option would be to just keep single stats method with no params and satisfy the requirements here by:

But this has one unpleasant drawback: raw counts from non-local nodes will be subtly different semantically from the local node since they will only update once every ping interval. It's also generally quite messy.

The alternative which I think is cleaner and I prefer is to add a new method stats_raw or similar which has following semantics:

Either way we should also:

FZambia commented 8 years ago

@banks created separate issue for timers, closing this one, thanks a lot!