Azure / az-hop

The Azure HPC On-Demand Platform provides an HPC Cluster Ready solution
https://azure.github.io/az-hop/
MIT License
62 stars 52 forks source link

Dynamic partitions #1855

Open matt-chan opened 4 months ago

matt-chan commented 4 months ago

In what area(s)?

/area administration /area ansible /area autoscaling /area configuration /area cyclecloud /area documentation /area image /area job-scheduling /area monitoring /area ood /area remote-visualization /area user-management

Describe the feature

Do we expose the dynamic partitions that CC adds in 8.4? I think it would be useful if we could allocate smaller nodes if the job is smaller. E.g. running a 4 cpu job on HB120 vs HB16.

matt-chan commented 4 months ago

cc @ltalirz

xpillons commented 4 months ago

I'm not sure about the exact scenario. It adds lots of complexity, and I'm not sure of the value provided

ltalirz commented 4 months ago

I think what Matt is saying here is:

For those VM series where Azure provides breakdowns into different sizes (e.g. NC24ads A100 v4, NC48ads A100 v4, NC96ads A100 v4), bundle those in one partition and then, based on the number of cpus/gpus requested, have slurm request the smallest one that fulfils the requirements of the job.

It does not really apply to the HB series, since the smaller versions here are just restricted CPUs with the same price, but it would e.g. also apply to the F series.

matt-chan commented 4 months ago

Ah I forgot about the HB series carrying the same price across all sizes. Yes, for the scenarios where you only want part of the node I think this might be useful. Although under heavy load I think this cost savings effect will disappear/get small. It can still provide better isolation between jobs though (one bad job can't fill up /tmp anymore etc)