Azure / az-hop

The Azure HPC On-Demand Platform provides an HPC Cluster Ready solution
https://azure.github.io/az-hop/
MIT License
65 stars 53 forks source link

Minimum config to explore azhop #1705

Open mkbane opened 1 year ago

mkbane commented 1 year ago

Hi, related to issue#1692, I wanted to step back and ask re the minimum expectation of azhop (I can't quite pin it down from the online materials/tutorials/documentation) and thus the minimum config of azure required to run azhop. Let me put my reason for looking at azhop up first, to give some context. I'm looking for a HPC cluster to teach some parallel programming techniques, and doing so in Azure seems cheaper than buying a cluster just for teaching in one semester. So, let's say having a virtual equivalent to 1 login node and 4 compute nodes, for compiling and running C codes (from CLI) with MPI/OpenMP to allow inter/intra-node parallelisation, with same SLURM running on the login node. For teaching how to write codes, check scaling etc, each compute node could be say 16c, and say login node is identical. That gives total of 5 nodes of 16c which would be 90c in total. I'd like reasonable chipsets but not particular about which type of CPUs (instance) since it's to show techniques and what's possible. I have 2 labs each week and students would do some work in their own time. So the full set of nodes is not needed to be running 24/7 and it'd be nice to use cloud's auto-scaling to only spin-up compute nodes when needed (which will be some factor of how congested the SLURM queue is becoming). Happy to expound on this / discuss alternatives. I'm no cloud expert btw. In terms of azhop, from azhop pre-requisites it appears the minimum requirement of azure is 10c of standard BS + 4c of standard DSv5 + (no lustre) + code server: 10c of standard FSv2 + compute nodes: 220c of standard HC44rs + remote viz: 24c of standard NVs_v3 (although I know I don't need a server of code, nor remote viz). This comes to 268 cores and as far as I can tell many of those instance families are out of scope of the free Azure trial (i.e. it's impossible to trial azhop for free by using the free azure credits ($200) available). I was most naively wondering why so many cores/nodes/instances needed to set up a minimal azhop if all that (for example) is required is 1 login + 4 compute nodes... (I can see why an AD node might be needed but even that service could sit on a login node?) Looking forward to understanding more! yours, Michael

themorey commented 1 year ago

Hello Michael, There is some base Azure infrastructure required to run AZHOP, ie. VMs and Storage. However, all the compute in the queues section (of theconfig.yml file) can be commented out or edited to meet your needs. Based on what you outlined above you would need something like this:

Role VM Size Qty vcpus
Jumpbox B2ms 1 2
AD B2ms 1 2
Ondemand D4s_v5 1 4
Scheduler B2ms 1 2
CycleCloud B2ms 1 2
Grafana B2ms 1 2
Compute F16s_v2 5 90

This would be the base AZHOP config you would need. The Ondemand node provides a Web Portal using Open OnDemand and would be your login node. The Compute nodes would be autoscaled on & off as jobs are submitted. For Storage (ie. Home mount) I would use azurefiles as its cheaper and has a smaller initial size compared to anf (ie. 100GiB vs 2TiB).

Alternatively, you could manually deploy CycleCloud from the Azure Marketplace and use it to create a Slurm cluster. This would remove the Jumpbox, AD, Ondemand and Grafana VMs but provide you the compute node autoscaling. It would not have the GUI interface but instead use SSH to access the Login node. For comparison:

Role VM Size Qty vcpus
Scheduler B2ms 1 2
CycleCloud B2ms 1 2
Login B2ms 1 2
Compute F16s_v2 5 90

hth, Jerry

mkbane commented 1 year ago

Hi Jerry, that is most useful. The list you give is smaller than that given in the azhop pre-requisites. However, if B2ms = standard B series v2 (not sure why Azure uses multiple names for same resource?!) then it appears you are saying that 5 such VMs (equating to 10 vCPUs) are required, which is just over the 4 VMs that I believe is available with the $200 free credit Azure trial/subscription:

image

Similarly I believe the below screenshop is saying only 4 nodes (VMs) of F series v2 avail to this free subscription: image

Yours, Michael