Closed banks closed 8 years ago
Hello! Thanks for detailed proposal!
I agree with almost everything, the only thing: at work we send metrics directly to Graphite and just want to see actual values aggregated over one minute interval, so for us it's important to keep current behaviour with counters. So it seems we need to support both ways
And I especially agree with you thoughts about current timer behaviour - absolutely useless:)
send metrics directly to Graphite
Interesting, how does that work - I haven't seen options to configure output to graphite in the code?
Keeping optional time-based counters seems reasonable - I did originally suggest that but my ticket got too long so I simplified ;)
Hm I see go-metrics supports graphite output but I don't see any code that configures that, plus if you did it at go-metrics level it wouldn't use the existing metrics interval stuff anyway...
Our main site written in Django and we use Celery to do async or periodic jobs, so we just ask stats
from Centrifugo every minute and send returned metrics into Graphite.
@app.task
def export_centrifugo_metrics():
client = Client(settings.CENTRIFUGO_ADDRESS, settings.CENTRIFUGO_SECRET, timeout=2)
stats, error = client.stats()
if error:
raise error
fields = (
"cpu_usage", "num_goroutine", "num_msg_published",
"time_api_max", "num_api_requests", "time_client_max",
"time_client_mean", "num_unique_clients", "num_msg_queued",
"memory_sys", "bytes_client_in", "num_cpu", "num_clients",
"time_api_mean", "num_client_requests", "num_msg_sent",
"bytes_client_out", "gomaxprocs", "num_channels"
)
nodes = stats["nodes"]
for node_info in nodes:
name = node_info["name"]
for field in fields:
value = node_info.get(field)
graphite_key = "centrifugo.%s.%s" % (name, field)
send_to_graphite([graphite_key, value, time.time()])
lol, just thought that we have a problem with counters with this approach:)
UPD: no problem - forgot how metrics in Centrifugo work))
Yeah that is equivalent to what Diamond does although diamond already supports remembering the last value of a counter and only submitting the differnce each 10 seconds (or however often it runs).
In your setup, sync between when celery job runs and when Centrifugo aggregates could cause races where same values get submitted fro 2 minutes and one real minute get lost.
This is why plain counters are more general and the external tool should manage figuring out rate of change base don when it actually executes.
Anyway we can keep the 1 min summaries if you want - it does save you from keeping persistent state elsewhere in your setup.
From gitter discussion:
First option would be to just keep single stats
method with no params and satisfy the requirements here by:
local_node_name
or UUID such that consumer can easily filter to just local node without more options or relying on coordinated config.But this has one unpleasant drawback: raw counts from non-local nodes will be subtly different semantically from the local node since they will only update once every ping interval. It's also generally quite messy.
The alternative which I think is cleaner and I prefer is to add a new method stats_raw
or similar which has following semantics:
Either way we should also:
go-metrics
and just store struct of uint64 for counters with atomic increments and loads
atomic.LoadUint64(&counter)
the raw counter and store the difference between that and the count we saved last update, then save the new raw valuestats
response we return only the difference@banks created separate issue for timers, closing this one, thanks a lot!
I'm looking into the metrics that Centrifugo exposes via
stats
api call as we move to put it into production.I have a couple of issues that are not show-stoppers but I think can be better:
name
to--name
is given at runtime in someone's setup it will break. Relying on--name
is marginally better but requires that your centrifugo config and diamond config agree on the name to look for, and that you coordinate unique names across the cluster somehow. All of this is possible with good config management but it seems way to hard for something so simple.node_metrics_interval
). Which is OK if you are consuming them directly, but for counters it makes it impossible for external process like diamond to track actual number of requests in between polls. I.e. you can't get better than configured granularity.node_metrics_interval
of a moving decaying sample... So what you actually see is the "recent weighted" mean/max value as of up to 1 minute ago.As well as that, there is also a minor race condition:
updateMetricsOnce
locks the metrics mutex and copies values from counters and histogramsProposal
Here is one option that I think solves all my issues, it's quite a big change and there are certainly variations that would still work for our case if there are downsides I don't see with this solution.
go-metrics
. It's useful and has some interesting properties, but we are already losing almost all the value by adding the interval based update to publically visible values. In addition, sampled histograms are wrong way to measure latency, and without those, the rest of what we use can just be replaced with integer counters.local_node_only
param tostats
OR create new api endpointlocalstats
node_metrics_interval
and periodic updating - just return current counter values on each call tostats
(or each ping message sent to other servers)