Job start/end information for the flux-power-mgr module

tpatki commented 3 years ago

The current flux-power-mgr module aggregates power across the entire flux instance on rank 0. It builds a simple DAG for scalability, samples every second, aggregates power, and then reports instance-level node, GPU, CPU, memory power in the KVS. This module needs to be informed of job start/stop boundaries, along with job allocation and rank 0 for each job in the instance instead. Need some suggestions on how this association can be done.

https://github.com/rountree/flux-power-mgr/tree/cleanup_Oct2020

Tagging @rountree as well.

garlick commented 3 years ago

For making a power report part of the KVS job record at the end, one idea would be to have a job manager "jobtap" plugin for power register a cleanup activity that would occur between the job terminating and the job entering the INACTIVE state, when the job record in the KVS is supposed to be complete.

The plugin could make an RPC to the power module indicating the job ID, start, and end time. The power module could use the job ID to fetch the job's resource assignment R from the KVS, calculate power usage, and return it to the plugin. The plugin could then make the power record part of the job record (for example by posting an event to the job eventlog) and allow the job to transition to INACTIVE.

(Some work on jobtap plugin interface required, mentioned in #3755)

I kind of like the idea of placing power info in the job eventlog. If you decided you wanted to add periodic instantaneous power data or log changes to power caps, then you would have one place in which those events can be seen in the context of other time stamped job events (like job exceptions, or eventually grow/shrink).

garlick commented 3 years ago

Following up on today's ☕ discussion:

One possible way to proceed to a correct but unoptimized first cut might be:

Modify/rewrite flux-power-mgr to collect data without reducing it, so that rank 0 has power values for all nodes
Keep that data in a circular buffer
A jobtap plugin could make periodic queries to flux-power-mgr on behalf of the job to gather power data for the nodes it is running on
Jobtap could integrate samples over time and emit this data at a configurable resolution and degree of reduction in the job eventlog. A default could be to emit one "event" for the entire job when it ends.

So the proposal is that the data in the circular buffer is wide in the sense that each time step would include a value from every node, but potentially shallow in the time dimension since only enough data would need to be kept to ensure a time step is not clobbered before jobtap can retrieve it on its own sampling interval.

Come to think of it, possibly the circular buffer could be done away with or kept very shallow if we implemented a streaming RPC protocol between jobtap and flux-power-mgr:

jobtap sends request containing jobid and options for reduction
flux-power-mgr looks up R by job ID in the KVS to get nodes used by job
flux-power-mgr starts sending responses to jobtap containing reduced power sample data, and repeats for each time step
jobtap does whatever integration necessary to adapt flux-power-mgr sample interval to its configured sample period, and writes events to job eventlog as needed.
upon job exit, RPC is terminated by jobtap and flux-power-mgr can forget the job state (e.g. the original request and R)

I think it would be wise to keep this simple at first. For example, just have every rank send a separate RPC directly to rank 0 with its sample data initially. Later, some "reduction" (even if that is only to combine message payloads) could be implemented.

FWIW the heartbeat-synchronized callbacks I mentioned can be implemented with these: https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man3/flux_sync_create.html

You asked about a good circular buffer implementation. I don't have a great recommendation, but if I were doing it I would probably start with a zlist_t, and each time I add a time step to the end of the list with zlist_append(), I would check zlist_size() and pop one off the other end with zlist_pop() if the list is overfull. As far as what goes on the list, maybe a jansson json_t object containing all the data from one time step? Whatever keeps it as simple as possible.

Some docs on jobtap plugin interfaces: https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man3/flux_jobtap_get_flux.html

I think we could assist with some early prototyping if we can first agree on a good design. Consider the above just a straw man!

garlick commented 3 years ago

This might be a better jobtap reference: https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man7/flux-jobtap-plugins.html

tpatki commented 3 years ago

Thank you, @garlick, @grondo and @SteVwonder for an insightful discussion. I remember discussing a system-instance (monitor + cap if needed) versus a job-level monitoring (no capping, we don't want regular users doing capping) plugin with @dongahn in the past.

I would lean toward developing both plugins, because we would need something at the system level keeping track of all nodes, even if no jobs are running on them, or if a job crashes, for example. Another advantage of a system-level instance that only monitors node-level power is for predicting power swings etc, which need more aggregated data. What we currently have is good for system-level instance, and we can extend it to include node hostnames if needed, or figure out a way to push the collected samples to a database or backend.

And we should do the job-level power monitoring one with jobtap + temporary buffer as Jim recommended.

garlick commented 1 year ago

Following up on discussion today, a plan that was discussed was

convert existing power module from putting samples in the KVS to keeping a circular buffer
configuration of sample interval and buffer entry TTL should eventually be configurable but for now could be hard wired
add RPC handler that allows summary power information to be queried based on a time interval (job start and end times) and an idset of broker ranks (nodes the job ran on)
add a jobtap plugin that makes this query at the end of a job and posts the summary as an event in the job eventlog

grondo commented 1 year ago

After a discussion on slack today, it seems like more clarification might be required on some of the above items:

add RPC handler that allows summary power information to be queried based on a time interval (job start and end times) and an idset of broker ranks (nodes the job ran on)

Note: This RPC handler would be in the power module, not the jobtap plugin. The RPC client would be responsible for sending start, end and ranks as input parameters to the query.

add a jobtap plugin that makes this query at the end of a job and posts the summary as an event in the job eventlog

After some discussion, one of the requirements here is to be able to get the full sample log for a job (so that a power timeline can be made availabl). While it will be useful to post a single aggregate number to the job eventlog, the jobtap plugin could also write a full sample set in whatever format to the job KVS directory as part of this final action. Then that would be available until the job KVS directory is purged.

Because the jobtap plugin will need to make a few asynchronous RPCs after the job finish event, it should use an epilog action to ensure the job does not transition to INACTIVE before the final data is emitted into the eventlog and/or the KVS.

Note: the above all assumes a user will only query power information about a job after the job has finished. If we want to allow a query at runtime, there are many possible solutions

the jobtap plugin can make periodic queries and add journal-only annotations to the job. At any given time flux jobs could then display the aggregate power usage for a job up to the last query.
The power module RPC could be opened up to guest access and a front-end utility could be added to first query a job's nodelist, then send the appropriate query to the power module, thus getting raw results directly.

flux-framework / flux-core

Job start/end information for the flux-power-mgr module #3752