Azure / cyclecloud-slurm

Azure CycleCloud project to enable users to create, configure, and use Slurm HPC clusters.
MIT License
55 stars 42 forks source link

Allow local override of GPU count #198

Closed ryanhamel closed 6 months ago

ryanhamel commented 7 months ago

To both solve the issue where CycleCloud cannot report the proper number of GPUs, due to incorrect backend data, as well as the use case where a user needs to override the reported number of GPUs, allow the ability to define these using scalelib's built-in default_resources mechanism. For example, by default we would ship with simply "default_resources": [ { "select": {}, "name": "slurm_gpus", "value": "node.gpu_count" } ],

However, if a user wants to / needs to define Standard_XYZ as having 8 GPUs, they can do that by simply editing /opt/azure/slurm/autoscale.json: "default_resources": [ { "select": {"node.vm_size": "Standard_XYZ"}, "name": "slurm_gpus", "value": 8 }, { "select": {}, "name": "slurm_gpus", "value": "node.gpu_count" } ],