idea: instance wide stats collection service

grondo commented 3 years ago

Currently we do have some high-level metrics available for a Flux instance, e.g. we can discern some information from the various job eventlogs, maybe flux dmesg, and there are per-module built in message counters (though those are overridden if a module provides customs stats as in the kvs module and job-info)

Trying to diagnose a couple recent performance issues made clear that it would be nice if we had a more formal way to gather metrics in a flux instance. This issue describes one idea for that interface.

The basic idea is to have an instance wide metrics or stats aggregation service, inspired by existing projects like Etsy's StatsD or GitHub's brubeck (See also https://github.blog/2015-06-15-brubeck/)

As an experiment we could try just running a brubeck instance alongside Flux and experimentally send metrics from all brokers and modules directly to its UDP server. Supposedly a brubeck instance can handle 4M metrics per second.

However, a more flux-like stats aggregation service might be to include an aggregator with every flux handle. Metrics could be locally aggregated and forwarded "upstream" when appropriate, perhaps on a timer or idle watcher for flux handles being used with a reactor, maybe some other clever way for handles that aren't being used with a reactor (if that is even necessary). If nothing else, even being able dump aggregated statistics from a handle used in a utility may be insightful.

The end result would be that metrics for the entire instance would be aggregated at the root, perhaps in a new module, which can offer services to dump all statistics or be configured to send stats periodically to graphite or other well-developed stats presentation software.

A new metrics API would be added to libflux to push metrics into a handle. We could start with supporting the standard statsd metric types

Probably starting with some simple counters and timers could go a long way to opening up observability into a flux instance.

As an example, take the recent case of a broker that was observed via top to be using >100% CPU, though the instance was not making much progress in handling job throughput.

If modules by default appended to a counter on every wakeup, then we might be able to check our stats service and easily see which module is consuming all the CPU time. We could reinstate the builtin module prep/check watchers to observe the amount of time every module spends sleeping in the reactor vs doing work, and perhaps in an instance that is being slow we could pinpoint the module that is taking longer to handle events over time.

By aggregating the results first in the local handle, a module or handle user can insert multiple metrics per event loop, but the overall number of messages sent upstream is fixed at some interval, so the broker is not overwhelmed by busy modules as int the case of the keepalive messages.

garlick commented 3 years ago

Mentioning @garrettbslone here as we just discussed this a bit offline as a potential summer project.

vsoch commented 9 months ago

Did this turn into https://flux-framework.readthedocs.io/en/latest/tutorials/integrations/stats.html?h=stats#instrumenting-flux-with-statsd?

Also, the way you suggested for bursting (which we also used for the flux-metrics-api) works really nicely! That could run on an HPC system that can run (and ping) a server that is connected to some broker socket, but it also works really well as a sidecar container in Kubernetes (also with access to the same shared socket) see the example here: https://github.com/flux-framework/flux-operator/tree/a83fb023a1fdd47c3eae9b519c35302da1506ec4/examples/experimental/metrics-api

Is there additional work needed for "internal to flux" to be exposed via the Python bindings that might trickle up into this design? Or something else?

flux-framework / flux-core

idea: instance wide stats collection service #3517