Azure / az-hop

The Azure HPC On-Demand Platform provides an HPC Cluster Ready solution
https://azure.github.io/az-hop/
MIT License
65 stars 53 forks source link

Test queues #1172

Open matt-chan opened 2 years ago

matt-chan commented 2 years ago

In what area(s)?

/area administration /area ansible /area autoscaling /area configuration /area cyclecloud /area documentation /area image /area job-scheduling /area monitoring /area ood /area remote-visualization /area user-management

Describe the feature

Hi Xavier,

It would be great if we could set a few test queues in azhop. This would let our users run quick jobs without having to wait for node spinup time.

Currently, I'm approximating the behavior by setting a large idle time on some queues, but would be nice to have a setting which actually keeps the nodes alive forever using the slurm setting here: https://learn.microsoft.com/en-us/azure/cyclecloud/slurm?view=cyclecloud-8#excluding-a-partition. Also another common feature of these test queues is a short job timelimit. I don't see a way to set this from cyclecloud right now though, even though it is in /etc/slurm/cyclecloud.conf.

Thanks! Matt

xpillons commented 2 years ago

@matt-chan instead of excluding nodes from partition, I'm thinking on have a parameter to define how many cores/VM should always been on for each queue/partitions.

matt-chan commented 2 years ago

Hi Xavier, Yes I think that behavior would be best if we could achieve it, but I'm not certain it's possible. I originally tried to make a PR before making this feature request but I couldn't figure out how to do it. I'm not sure Cyclecloud and slurm have that functionality.

Your team is definitely better at this stuff than I am. If you can figure it out, it would be a great feature! Just to make sure we're on the same page, it is the number of idle VMs we want to keep in each queue right? So if there are 5 jobs, and the idle setting is 2 VMs, there should be 7 VMs running in total?

xpillons commented 2 years ago

@matt-chan the way it works is that it will always keep x number of nodes always running. If they are filled by jobs then new nodes will be added up to the quota define for that queue/partition. I'm afraid that having always yy number of nodes above the allocated ones is not possible today.

ltalirz commented 1 year ago

@xpillons I now implemented a simple solution for this. The following script is run as a cron job every 5 minutes on weekdays (I have it on ondemand, but I guess it should move to the scheduler VM). I think it is self-explanatory

#!/bin/bash
# Usage: ./warmup-queues.sh viz hb2la
set -e

# SLURM node states & state flags on AZ-HOP

# idle   VM allocated and idling
# idle~  VM not allocated from Azure
# idle#  VM being allocated from Azure
# idle%  VM being powered down
# mix    Some CPUs allocated but not all

for queue in "$@"; do
  available=`sinfo -p $queue --states=mix,idle --noheader | grep -v idle~ | grep -v idle# | grep -v idle% | wc -l`
  allocating=`sinfo -p $queue --states=idle --noheader | grep idle# | wc -l`

  if [[ $available == 0 && $allocating == 0 ]]; then
    echo "Allocating 1 node on queue $queue"
    srun --partition $queue bash > /dev/null 2>&1 &
    PID=$!
    sleep 2
    set +e
    kill $PID
    set -e
  elif [[ $available -gt 0 ]]; then
    # "touch" one available node so that it won't be deallocated by slurm after timeout
    set +e
    srun --partition $queue "exit" > /dev/null 2>&1 &
    set -e
  fi
done

The admin can set a warmup field on any queue in config.yml. These queues are passed as arguments to the cronjob.

Let me know if you are interested in a PR for this

P.S. This creates one extra job every 5 minutes per queue. There may be more "official" ways of doing this via the slurm config https://slurm.schedmd.com/power_save.html#config but it already does the job

xpillons commented 1 year ago

@ltalirz sounds a great start. Need to be run on the scheduler. Also ideally it should read the config file and pickup partition names, number of nodes to allocate.

ltalirz commented 1 year ago

Also ideally it should read the config file and pickup partition names

This is already how it works; the cronjob is

    - name: set up cronjob for queue warmup
      cron:
        name: "queue-warmup"
        job: "/usr/local/sbin/queue-warmup.sh {{ warmup_queues | map(attribute='name') | join(' ') }}"
        minute: "*/5"
        weekday: 1-5
        user: "root"
        state: "present"
      vars:
        warmup_queues: "{{ queues | selectattr('warmup', 'defined') | selectattr('warmup', 'equalto', true) }}"

Modification for keeping >1 warm nodes will require some modifications (more touching of nodes needed) but should be doable I guess. In practice, 1 idling node (at all times) is already a great improvement in user experience and often all you need.