Suggestions for further metrics improvements

fenek commented 8 years ago

MongooseIM version: master (2b9658e)

While working on all_metrics_are_global switch, I came up with potential TODOs or doubts that IMO should be a subject of discussion/separate PRs.

[ ] Cleanup of mongoose_metrics
[ ] Unify metrics naming convention
[ ] Move all metrics initialisation and updates to the respective modules
Cleanup of mongoose_metrics

Even though I made some refactoring in global metrics branch, I think there is still some room for improvement like possible dead exports, function/macro names that could be improved etc.

Unify metrics naming convention

Currently we have a strange mix of camelCase and underscored metric names. I know that some names are derived from hook names but this means they could be nested. E.g. mam_lookup_messages could become mod_mam.hooks.mam_lookup_messages while predefined ones like modMamFlushed would become mod_mam.flushed_messages. We could avoid . delimiter and just nest JSON objects in metrics API (and obviously still keep list-style names under the hood). This would require proper handling in API module and I think it depends which style is better handled by popular monitoring software.

BTW: fetching specific metrics via API that have names constructed from a list is broken, because the provided name is never split anywhere in mongoose_api_metrics. :(

Move all metrics to the respective modules

Long time ago it began with mod_metrics and limited, nice-to-maintain set of metrics. Right now it is a bit awkward to have some metrics being handled in the respective modules (like auth ones) and some by mongoose_metrics_hooks. I think it should be possible to make mongoose_metrics pure. I guess it could involve avoiding macro usage in ensure_metric calls to allow very easy and efficient extraction of all available metrics (just in case) with single grep.

GalaxyGorilla commented 7 years ago

Hey @fenek, recently I played around with MIM metrics and one thing that catched my eye was that there are quite a lot of spirals where I would have expected simple counters (see metrics definitions). So what happens here is that per spiral metric there is a time window (IIRC 60s by default) where all e.g. requests are counted and hence the final metric gives you e.g. the amount of requests in the last 60s. By default this sliding window moves on in 1s steps which is possible because the counting is happening in 1s "buckets". There are some negative aspects about this:

Those spirals are hard to handle for people who don't have any background with exometer. Think about operation guys, they don't have a clue that a metric called xmppStanzaCount is something using a 60s time window in the background. Visualising those spirals turns out to be very irritating.
Usually TSDBs are capable of aggregating necessary information to generate e.g. a "per second" metric or similar from simple counters. I surely don't know every TSDB and related tools in the world but I have never seen that this is not possible.
In my experience exometer_slot_slide based metrics like spirals and histograms are quite expensive compared to counters. IIRC there are processes for each running in the background and obviously there is some number crunching involved since those "buckets" need to be kept up to date. I have no hard numbers for you to present but I strongly suggest to compare those to normal counters.

michalwski commented 7 years ago

Hi @GalaxyGorilla,

Thanks for you feedback!

Re spiral metrics: a) The last minute value is very useful on production. It shows you easily if load related to given metric is changing (increasing/decreasing). b) Spiral metrics reports also total count so this can be used as ordinary counter metric.
You are right, TSDBs are capable of that.
You are right, they are not for free. In our production systems we didn't see them a bottleneck though. The memory consumption of a single node is a little bit higher, but that's a constant change and doesn't growth over time. These metrics doesn't have noticeable impact on overall MongooseIM performance.

GalaxyGorilla commented 7 years ago

Heyho @michalwski,

that "very useful" depends on how you use it. The exposed json/xml metrics just offer the 60s window value as far as I can see and I encountered some struggling people (including myself) when I visualised these values. If you use the plain API, you know what is happening in the background and you don't want different "time windows" then this is sure easier to handle.

I personally ended up with a custom (very very simple) implementation of the metrics API which now serves as a prometheus scrape endpoint.

michalwski commented 7 years ago

All right, I see you are talking about MongooseIM's HTTP API exposing metrics. This is an old code, we don't use in production. On our production systems we usually use exometer_report_graphite to push data directly from MongooseIM to graphite or InfluxDB (it has graphite endpoint).

GalaxyGorilla commented 7 years ago

Oh, sorry for the late answer and thanks for clearing that up :). I was totally not aware of the graphite stuff.

esl / MongooseIM