accel-sim / accel-sim-framework

This is the top-level repository for the Accel-Sim framework.
https://accel-sim.github.io
Other
289 stars 110 forks source link

Consolidate calls to slurm commands #237

Open FJShen opened 1 year ago

FJShen commented 1 year ago

util/job_launching/job_status.py is making too many calls to Slurm monitoring commands squeue/sacct/sstat. For every single job, one call to squeue and one call to either sacct or sstat are made every 30 seconds. When Jenkins regression tests are running, this can generate many invocations which go against Slurm developers' suggestions to call them sparingly - https://slurm.schedmd.com/squeue.html#SECTION_PERFORMANCE , https://slurm.schedmd.com/sacct.html#SECTION_PERFORMANCE, https://slurm.schedmd.com/sstat.html#SECTION_PERFORMANCE. This is speculated to be a cause of sluggishness experienced on the cluster.