Monitoring (and control) across hierarchical instances

tpatki commented 4 years ago

Opening an issue corresponding to our slack discussion here.

For now, we are assuming one Flux broker per node (system instance) for the power aggregation module. We were discussing having multiple ranks per node as well as having multiple instances (with parent-child relationships between these). Monitoring, and especially control (capping power, or setting frequencies), is challenging in the latter scenario. We need to discuss this in user instances (jobs) versus system instances.

For power data, as well as profiling data in general, we have to aggregate across samples (per rank, say every second), and across ranks in an instance -- this can be done with an overlay network easily if we assume one "responsible" broker (system instance) per node. However, such aggregation can be challenging for user-level instances: hierarchical or otherwise. An example is having core-level Flux instances, but not having counters that exist at the core level -- accurately associating a socket-level counter with multiple core-level Flux instances, or aggregating across this is unclear. Control is even harder: capping power or setting DVFS frequencies. For example, if all instances try to set power caps without informing the other ranks on the node, then the last rank to set the power cap wins -- and ideally, the "lowest" power cap on the node should win.

We thus need a mechanism to communicate across different instances, and to label shared counters/data in some way. This can be done either by establishing a protocol through a common parent, or in a different way in the scenario where two Flux instances from two different users without a common parent are running on the same node.

We should discuss and document what the different cases are, and what our solution for each case is:

One system instance per node loads the power module, runs one "responsible" rank per node (assumed currently)
One system instance per node, and one user instance on the node, and each of them loads the power module.
One system instance per node, and multiple user instances per node from the same user (common parent)
One system instance per node, and multiple user instances per node from different users (different parents)

I may have missed a few...

rountree commented 4 years ago

Here's a possible solution.

Assume child-flux-broker (correct term?) can read/write the save KVS as the parent flux broker. Whichever instance gets around to it first can establish a publish-subscribe schema for each sharable metric it knows about (including the rate at which it is to be sampled, whether min/max/average/instantaneous will be provided, etc.

A new-to-this-world broker can scan to see what is already being monitored and subscribe as needed. If what they need isn't being monitored, or needs to be monitored at a different rate, etc., the new-to-this-world broker can set up their own publish/subscribe schema.

As an added bonus, if a new-to-this-world broker was doing the monitoring and exits, a message gets sent out to each instance that had subscribed, giving them the chance to take over/restart monitoring and letting all of the subscribers find that service.

Benefits:

Entities interested in "rolling up" several measurements now no longer have to care which level of the hierarchy is recording them, just that somebody is recording. Whatever "rolling up" means is up to the entity doing the monitoring.
No distinction made among system brokers and child (or grandchild) brokers.
Applies to any telemetry, not just power.
Avoid the problem of lots of instances pounding away at the same telemetry APIs.

Downsides

Some entities (GEOPM?) may want low-latency measurements and won't participate in this schema. (Supporting GEOPM is not part of El Cap.)
There should be a well-known method for identifying what can be measured. Variorum should be able to handle this.
Lots of potential traffic within a node if we're sampling every 10 ms.

Thoughts?

grondo commented 4 years ago

@tpatki @rountree - great discussion here! We had a bit of an offline discussion about this and came up with the following conclusions:

Long term we definitely hope to support a similar concept to what you've outlined above @rountree. See #2999. The "accounting" service outlined in that issue could offer the hierarchical monitoring service you laid above (i.e. the monitoring service portion of the accounting subsystem could allow any flux API user to subscribe to updates)
It would involve good deal of design and development to get to the point of having a usable accounting system which could offer these features, so I would not encourage tying any deliverables or milestones to this feature yet :wink:
It may be premature to worry about multiple brokers sampling the same metrics at this point in the Flux development and deployment cycle. The majority of batch jobs will not be loading power monitoring modules by default. The only time this might happen is when you want to run an instance to test a new version of the power monitoring modules or for experimentation, etc. This would be a rare occurrence, so worrying about that use case might be a distraction for now.

Therefore, my suggestion for now would be to assume that only the system instance loads the power module, and not worry about the multiple broker cases you've delineated above in the near term.

When we start to implement the monitoring/accounting system described in #2999, this issue can be one of our main use cases and we can then work on tying in the power monitoring to that service as we develop it, and then naturally get the benefits you've described above.

Does this make any sense, or have I missed a crucial near-term use case?

rountree commented 4 years ago

@grondo That sounds reasonable, but @dongahn has been concerned about an initial solution that requires distinguishing the system instance from other instances on the node. I tried to get around that in my suggestion by implying that all instances are really equal, it's just the first to load the module that wins.

How would you address this issue in the near term?

dongahn commented 4 years ago

@rountree: my main concern was the complexity involved in aggregating the values through the Flux instance hierarchies and my suggestion would be to work on a single instance. If you can demonstrate the capability within a single instance, this will be a very nice preliminary foundation on which to build.

dongahn commented 4 years ago

@grondo makes really good point above. There are techniques that we can use to load the module only to a specific instance (e.g., controlling this behavior through rc1 script etc.) So my suggestion is still to demonstrate this in a "single instance" in the same or a similar way that @grondo suggested. Once we know how to do this well in a single instance and have a code for it, we can layer our future solutions on top of that to extend it. And yes, how to deal with shared counter and etc can be future topics.

If we can agree that enabling the power monitoring and capping only on a single instance, what are the main remaining problems to solve in ~3 weeks?

flux-framework / flux-core

Monitoring (and control) across hierarchical instances #3127