GreenScheduler / cats

CATS: the Climate-Aware Task Scheduler :cat2: :tiger2: :leopard:
https://greenscheduler.github.io/cats/
MIT License
50 stars 8 forks source link

SLURM plugin #40

Open andreww opened 1 year ago

andreww commented 1 year ago

At some point it would be nice to use carbon intensity to help schedule tasks on HPC clusters. In principle the 'backend' of cats could help with this and the obvious approach is to somehow plug into SLURM. For example, on an under used cluster, it may be best to run user jobs only during low carbon intensity times and let the queue build up when carbon intensity is high. We would presumably need to build a SLURM plugin (https://slurm.schedmd.com/plugins.html) and work with a team managing a cluster. This issue is to keep track of ideas around this.

sadielbartholomew commented 1 year ago

On this topic, a colleague of mine @dlrhodson has made a nice suggestion:

I wonder if this [CATS] could be used to nudge HPC scheduling peaks to low CO2 intensity periods? I only just discovered Archer logs the energy consumed per job: https://docs.archer2.ac.uk/user-guide/energy/ ! ... I guess there's a bit of tension with machine utilization, but I bet that there is a peak in jobs that is somewhere in 9-5pm, just because that is when folk submit tasks - hence a weak coupling with mean daily working patterns - if a scheduler like CATS could nudge this peak to a lower emissions time, it could have a big effect?

where for essential context, ARCHER2 uses SLURM as its scheduler, so the plugin here would be a means towards this. Of note is (quoted from the link above):

Energy usage for a particular job may be obtained using the sacct command

and also:

On compute nodes, the raw energy counters and instantaneous power draw data are available at: /sys/cray/pm_counters

such that the information for the --jobinfo is readily available if we can interface between the storage of that and CATS.

colinsauze commented 1 year ago

We've been talking with the SSI about trying to find some target HPC systems to do exactly this. I'm not so sure about a 9-5 peak though, most HPC systems run near 100% load most of the time and many jobs last long enough to keep them busy all night. We did some analysis on this in Supercomputing Wales and found the system was quietist from Sunday evening to the middle of Monday as most of the jobs submitted on Friday finished by Sunday evening and people took a few hours on Monday morning to start submitting new jobs.

Our ideal target system might be something that's a bit less popular, more likely to be things like departmental clusters or high throughput systems.

colinsauze commented 6 months ago

Setting this up as an issue to cover WP2 in the CATSv2 project.