flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

flux-top: display CPU% utilization #3988

Open garlick opened 2 years ago

garlick commented 2 years ago

A use case for flux top brought up by AHA Moles team was monitoring job ensembles for CPU utilization, as an aid to tuning machine learning jobs.

This could be collected at the shell plugin level, with the rank 0 shell periodically posting an aggregate number as a job memo, which could then be accessed by flux jobs and flux top. The sample interval could default to some long period like a minute, and be tunable by shell option.

One challenge for flux top as it is currently implemented is that it only queries job-list after job state change events are published. Maybe we could have flux top watch for certain kinds of activity in the job manager journal instead? Or maybe job-list could provide a specialized streaming RPC for job monitoring tools.

grondo commented 2 years ago

Maybe we could have flux top watch for certain kinds of activity in the job manager journal instead?

Is the journal accessible by guests?

I thought you had an idea for a multi-response RPC for job-list which would only reply on updates. That might be a bit challenging to implement, though.

One challenge for flux top as it is currently implemented is that it only queries job-list after job state change events are published.

Would it be so bad to just query job-list every N seconds for now until a better solution is implemented?

garlick commented 2 years ago

Good point about journal permission!

I edited my description to include the job-list idea concurrently with your comment. Sorry about that.

Would it be so bad to just query job-list every N seconds for now until a better solution is implemented?

Yeah that would probably be fine for a first cut.