flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
166 stars 49 forks source link

job manager: track average and maximum job wait time #5909

Open garlick opened 4 months ago

garlick commented 4 months ago

Problem: average scheduler wait time could be a useful metric for evaluating performance of different scheduling algorithms.

Tuning EASY-Backfilling Queues, Lelon et al, JSSPP 2017 use average and maximum wait time for jobs in combination with job traces from the Parallel Workload Archive to evaluate various backfill scheduling optimizations

Average and maximum wait time would be really easy to add to the job manager's flux module stats output, where the wait time for any given job is just the time spent in ALLOC state. Since the job manager replays all the stored job's eventlogs on restart, the stats could be easily kept up to date, with purged jobs dropping off each time Flux restarts.

garlick commented 4 months ago

Another job manager metric that would be easy to capture and could give us insight into impact of things like partial release is node level resource utilization, e.g. average fraction spent in idle / offline / running / system (where system includes time spent in CLEANUP state).