GreenScheduler / cats

CATS: the Climate-Aware Task Scheduler :cat2: :tiger2: :leopard:
https://greenscheduler.github.io/cats/
MIT License
50 stars 8 forks source link

Simplify configuration of job information #79

Closed tlestang closed 6 months ago

tlestang commented 6 months ago

At the moment information on the job is passed to cats via the jobinfo cli argument:

partition=CPU_partition,memory=8,ncpus=8,ngpus=0

Information relating to hardware is assumed to be specified in the config file, e.g.

PUE: 1.20 # > 1
partitions:
  CPU_partition:
    type: CPU # CPU or GPU
    model: "Xeon Gold 6142"
    TDP: 9.4 # in W, per core

After looking closely at carbonFoootprint.py, I think the information required to estimate the carbon footprint boils down to the number of devices and their power consumption.

This PR is about simplying the configuration file and its processing, with the intent of simplfying the carbonFootprint.py module downstream. The suggested configuration structure is

location: "EH8"
api: "carbonintensity.org.uk"
PUE: 1.20 # > 1
profiles:
  CPU_partition: # Arbitrary name for first profile. First profile is also the default profile
    cpu:
      model: "Xeon Gold 6142"
      power: 9.4 # in W, per core
      nunits: 2
  GPU_queue:
    gpu:
      model: "NVIDIA A100-SXM-80GB GPUs" 
      power: 300 
      nunits: 2
    cpu:
      model: "AMD EPYC 7763" 
      power: 4.4 
      nunits: 1

You can then specify the profile to use for the footprint estimation at the command line. The footprint estimation is activated using the --footprint flag.

$ cats -d 180 --footprint -p GPU_queue --memory 8
$ cats -d 180 --footprint --memory 8# Use default profile, i.e. first profile in config
$ cats -d 180 --footprint --profile CPU_partition --cpu 8 -- memory 8 # Override config to specify 8 cpus instead of 2

The memory footprint must be specified at the command line.

The job info is processed from configure.get_job_info which returns a list of tuples (nunits, power) with one element per power-consuming device. So if you have 2 CPUs, 4 GPUs and 8GB of memory, the jobinfo is, assuming 0.4 W/GB for memory:

[(2, 9,4), (4, 300), (8, 0.4)]

Currently the jobinfo list returned by configure.get_runtime_config is not used, and args.jobinfo still is. This is contributed in a subsequent PR (see #80 ) in order to limit the amount of changes contributed.

tlestang commented 6 months ago

Need to update the docs before this goes in

@abhidg Are you gonna be okay to merge/rebase #78 if this is merged before?