Azure / cyclecloud-slurm

Azure CycleCloud project to enable users to create, configure, and use Slurm HPC clusters.
MIT License
55 stars 42 forks source link

Allow local override of GPUs #206

Closed ryanhamel closed 6 months ago

ryanhamel commented 6 months ago

Per README.md additions: For some regions and VM sizes, some subscriptions may report an incorrect number of GPUs. This value is controlled in /opt/azure/slurm/autoscale.json

The default definition looks like the following:

  "default_resources": [
    {
      "select": {},
      "name": "slurm_gpus",
      "value": "node.gpu_count"
    }
  ],

Note that here it is saying "For all VM sizes in all nodearrays, create a resource called slurm_gpus with the value of the gpu_count CycleCloud is reporting".

A common solution is to add a specific override for that VM size. In this case, 8 GPUs. Note the ordering here is critical - the blank select statement will set the default for all possible VM sizes and all other definitions will be ignored. For more information on how scalelib default_resources work, the underlying library used in all CycleCloud autoscalers, see the ScaleLib documentation

  "default_resources": [
    {
      "select": {"node.vm_size": "Standard_XYZ"},
      "name": "slurm_gpus",
      "value": 8
    },
    {
      "select": {},
      "name": "slurm_gpus",
      "value": "node.gpu_count"
    }
  ],

Simply run azslurm scale again for the changes to take effect. Note that if you need to iterate on this, you may also run azslurm partitions, which will write the partition definition out to stdout. This output will match what is in /etc/slurm/azure.conf after azslurm scale is run. Fixes: #198