Azure / cyclecloud-pbspro

Example Azure CycleCloud PBSpro cluster type
MIT License
12 stars 21 forks source link

Azure CycleCloud OpenPBS project

OpenPBS is a highly configurable open source workload manager. See the OpenPBS project site for an overview and the PBSpro documentation for more information on using, configuring, and troubleshooting OpenPBS in general.

Versions

OpenPBS (formerly PBS Professional OSS) is released as part of version 20.0.0. PBSPro OSS is still available in CycleCloud by specifying the PBSPro OSS version.

   [[[configuration]]]
   pbspro.version = 18.1.4-0

Installing Manually

Note: When using the cluster that is shipped with CycleCloud, the autoscaler and default queues are already installed.

First, download the installer pkg from GitHub. For example, you can download the 2.0.24 release here

# Prerequisite: python3, 3.6 or newer, must be installed and in the PATH
wget https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.24/cyclecloud-pbspro-pkg-2.0.24.tar.gz
tar xzf cyclecloud-pbspro-pkg-2.0.24.tar.gz
cd cyclecloud-pbspro
# Optional, but recommended. Adds relevant resources and enables strict placement
./initialize_pbs.sh
# Optional. Sets up workq as a colocated, MPI focused queue and creates htcq for non-MPI workloads.
./initialize_default_queues.sh

# Creates the azpbs autoscaler
./install.sh  --venv /opt/cycle/pbspro/venv

# If you have jetpack available, you may use the following:
# ./generate_autoscale_json.sh --install-dir /opt/cycle/pbspro \
#                              --username $(jetpack config cyclecloud.config.username) \
#                              --password $(jetpack config cyclecloud.config.password) \
#                              --url $(jetpack config cyclecloud.config.web_server) \
#                              --cluster-name $(jetpack config cyclecloud.cluster.name)

# Otherwise insert your username, password, url, and cluster name here.
./generate_autoscale_json.sh --install-dir /opt/cycle/pbspro \
                             --username user \
                             --password password \
                             --url https://fqdn:port \
                             --cluster-name cluster_name

# lastly, run this to understand any changes that may be required.
# For example, you typically have to add the ungrouped and group_id resources
# to the /var/spool/pbs/sched_priv/sched_priv file and restart.
## [root@scheduler cyclecloud-pbspro]# azpbs validate
## ungrouped is not defined for line 'resources:' in /var/spool/pbs/sched_priv/sched_config. Please add this and restart PBS
## group_id is not defined for line 'resources:' in /var/spool/pbs/sched_priv/sched_config. Please add this and restart PBS
azpbs validate

Autoscale and scalesets

In order to try and ensure that the correct VMs are provisioned for different types of jobs, CycleCloud treats autoscale of MPI and serial jobs differently in OpenPBS clusters.

For serial jobs, multiple VM scalesets (VMSS) are used in order to scale as quickly as possible. For MPI jobs to use the InfiniBand fabric for those instances that support it, all of the nodes allocated to the job have to be deployed in the same VMSS. CycleCloud handles this by using a PlacementGroupId that groups nodes with the same id into the same VMSS. By default, the workq appends the equivalent of -l place=scatter:group=group_id by using native queue defaults.

Hooks

Our PBS integration uses 3 different PBS hooks. autoscale does the bulk of the work required to scale the cluster up and down. All relevant log messages can be seen in /opt/cycle/pbspro/autoscale.log. cycle_sub_hook will validate jobs unless they use -l nodes syntax, in which case those jobs are held and later processed by our last hook cycle_sub_hook_periodic.

Autoscale Hook

The most important is the autoscale plugin, which runs by default on a 15 second interval. You can adjust this frequency by running

qmgr -c "set hook autoscale freq=NUM_SECONDS"

Submission Hooks

cycle_sub_hook will validate that your job has the proper placement restrictions set. If it encounters a problem, it will output a detailed message on why the job was rejected and how to resolve the issue. For example

$> echo sleep 300 | qsub -l select=2 -l place=scatter
Please do one of the following
    1) Ensure this placement is set by adding group=group_id to your -l place= statement
        Note: Queue workq's resource_defaults.place=group=group_id
    2) Add -l skipcyclesubhook=true on this job
        Note: If the resource does not exist, create it -> qmgr -c 'create resource skipcyclesubhook type=boolean'
    3) Disable this hook for this queue via queue defaults -> qmgr -c 'set queue workq resources_default.skipcyclesubhook=true'
    4) Disable this plugin - 'qmgr -c 'set hook cycle_sub_hook enabled=false'
        Note: Disabling this plugin may prevent -l nodes= style submissions from working properly.

One important note: if you are using Torque style submissions, i.e. those that uses -l nodes instead of -l select, PBS will simply convert that submission into an equivalent -l select style submission. However, the default placement defined for the queue is not respected by PBS when converting the job. To get around this, we will hold the job and our last hook, cycle_sub_hook_periodic will periodically update the job's placement and release it.

Configuring Resources

The cyclecloud-pbspro application matches PBS resources to azure cloud resources to provide rich autoscaling and cluster configuration tools. The application will be deployed automatically for clusters created via the CycleCloud UI or it can be installed on any PBS admin host on an existing cluster. For more information on defining resources in autoscale.json, see ScaleLib's documentation.

The default resources defined with the cluster template we ship with are

{"default_resources": [
   {
      "select": {},
      "name": "ncpus",
      "value": "node.vcpu_count"
   },
   {
      "select": {},
      "name": "group_id",
      "value": "node.placement_group"
   },
   {
      "select": {},
      "name": "host",
      "value": "node.hostname"
   },
   {
      "select": {},
      "name": "mem",
      "value": "node.memory"
   },
   {
      "select": {},
      "name": "vm_size",
      "value": "node.vm_size"
   },
   {
      "select": {},
      "name": "disk",
      "value": "size::20g"
   }]
}

Note that disk is currently hardcoded to size::20g because of platform limitations to determine how much disk a node will have. Here is an example of handling VM Size specific disk size

   {
      "select": {"node.vm_size": "Standard_F2"},
      "name": "disk",
      "value": "size::20g"
   },
   {
      "select": {"node.vm_size": "Standard_H44rs"},
      "name": "disk",
      "value": "size::2t"
   }

azpbs cli

The azpbs cli is the main interface for all autoscaling behavior. Note that it has a fairly powerful autocomplete capabilities. For example, typing azpbs create_nodes --vm-size and then you can tab-complete the list of possible VM Sizes. Autocomplete information is updated every azpbs autoscale cycle, but can also be refreshed manually by running azpbs refresh_autocomplete.

Command Description
autoscale End-to-end autoscale process, including creation, deletion and joining of nodes.
buckets Prints out autoscale bucket information, like limits etc
config Writes the effective autoscale config, after any preprocessing, to stdout
create_nodes Create a set of nodes given various constraints. A CLI version of the nodemanager interface.
default_output_columns Output what are the default output columns for an optional command.
delete_nodes Deletes node, including draining post delete handling
demand Dry-run version of autoscale.
initconfig Creates an initial autoscale config. Writes to stdout
jobs Writes out autoscale jobs as json. Note: Running jobs are excluded.
join_nodes Adds selected nodes to the scheduler
limits Writes a detailed set of limits for each bucket. Defaults to json due to number of fields.
nodes Query nodes
refresh_autocomplete Refreshes local autocomplete information for cluster specific resources and nodes.
remove_nodes Removes the node from the scheduler without terminating the actual instance.
retry_failed_nodes Retries all nodes in a failed state.
shell Interactive python shell with relevant objects in local scope. Use --script to run python scripts
validate Runs basic validation of the environment
validate_constraint Validates then outputs as json one or more constraints.

azpbs buckets

Use the azpbs buckets command to see which buckets of compute are available, how many are available, and what resources they have.

azpbs buckets --output-columns nodearray,placement_group,vm_size,ncpus,mem,available_count
NODEARRAY PLACEMENT_GROUP     VM_SIZE         NCPUS MEM     AVAILABLE_COUNT
execute                       Standard_F2s_v2 1     4.00g   50             
execute                       Standard_D2_v4  1     8.00g   50             
execute                       Standard_E2s_v4 1     16.00g  50             
execute                       Standard_NC6    6     56.00g  16             
execute                       Standard_A11    16    112.00g 6              
execute   Standard_F2s_v2_pg0 Standard_F2s_v2 1     4.00g   50             
execute   Standard_F2s_v2_pg1 Standard_F2s_v2 1     4.00g   50             
execute   Standard_D2_v4_pg0  Standard_D2_v4  1     8.00g   50             
execute   Standard_D2_v4_pg1  Standard_D2_v4  1     8.00g   50             
execute   Standard_E2s_v4_pg0 Standard_E2s_v4 1     16.00g  50             
execute   Standard_E2s_v4_pg1 Standard_E2s_v4 1     16.00g  50             
execute   Standard_NC6_pg0    Standard_NC6    6     56.00g  16             
execute   Standard_NC6_pg1    Standard_NC6    6     56.00g  16             
execute   Standard_A11_pg0    Standard_A11    16    112.00g 6              
execute   Standard_A11_pg1    Standard_A11    16    112.00g 6

azpbs demand

It is common that you want to test out autoscaling without actually allocating anything. azpbs demand is a dry-run version of azpbs autoscale. Here is a simple example where we allocate two machines for a simple -l select=2 submission. As you can see, job id 1 is using one ncpus on two different nodes.

azpbs demand
NAME      JOB_IDS NCPUS
execute-1 1       0/1  
execute-2 1       0/1

azpbs create_nodes

Manually creating nodes via azpbs create_nodes is also quite powerful. Note that it also has a --dry-run mode as well.

Here is an example of allocating 100 slots of mem=memory::1g or 1gb partitions. Since our nodes have 4gb each, then we expect 25 nodes to be created.

azpbs create_nodes --keep-alive --vm-size Standard_F2s_v2 --slots 100 --constraint-expr mem=memory::1g --dry-run --output-columns name,/mem
NAME       MEM        
execute-1  0.00g/4.00g
...
execute-25 0.00g/4.00g

azpbs delete_/remove_nodes

azpbs supports safely removing a node from PBS. The different between delete_nodes and remove_nodes is simply that delete_nodes, on top of removing the node from PBS, will also delete the node. You may delete by hostname or node name. Pass in * to delete/remove all nodes.

azpbs shell

azpbs shell is a more advanced command that can be quit powerful. This command fully constructs the entire in-memory structures used by azpbs autoscale to allow the user to interact with them dynamically. All of the objects are passed in to the local scope, and can be listd by calling pbsprohelp(). This is a powerful debugging tool.

[root@pbsserver ~] azpbs shell
CycleCloud Autoscale Shell
>>> pbsprohelp()
config               - dict representing autoscale configuration.
cli                  - object representing the CLI commands
pbs_env              - object that contains data structures for queues, resources etc
queues               - dict of queue name -> PBSProQueue object
jobs                 - dict of job id -> Autoscale Job
scheduler_nodes      - dict of hostname -> node objects. These represent purely what the scheduler sees without additional booting nodes / information from CycleCloud
resource_definitions - dict of resource name -> PBSProResourceDefinition objects.
default_scheduler    - PBSProScheduler object representing the default scheduler.
pbs_driver           - PBSProDriver object that interacts directly with PBS and implements PBS specific behavior for scalelib.
demand_calc          - ScaleLib DemandCalculator - pseudo-scheduler that determines the what nodes are unnecessary
node_mgr             - ScaleLib NodeManager - interacts with CycleCloud for all node related activities - creation, deletion, limits, buckets etc.
pbsprohelp            - This help function
>>> queues.workq.resources_default
{'place': 'scatter:group=group_id'}
>>> jobs["0"].node_count
2

azpbs shell can also take in as an argument --script path/to/python_file.py, allowing the user to have full access to the in-memory structures, again by passing in the objects through the local scope, to customize the autoscale behavior.

[root@pbsserver ~] cat example.py 
for bucket in node_mgr.get_buckets():
    print(bucket.nodearray, bucket.vm_size, bucket.available_count)

[root@pbsserver ~] azpbs shell -s example.py 
execute Standard_F2s_v2 50
execute Standard_D2_v4 50
execute Standard_E2s_v4 50

Timeouts

By default we set idle and boot timeouts across all nodes.

   "boot_timeout": 3600

You can also set these per nodearray.

   "boot_timeout": {"default": 3600, "nodearray1": 7200, "nodearray2": 900},

Logging

By default, azpbs will use /opt/cycle/pbspro/logging.conf, as defined in /opt/cycle/pbsspro/autoscale.json. This will create the following logs.

/opt/cycle/pbspro/autoscale.log

autoscale.log is the main log for all azpbs invocations.

/opt/cycle/pbspro/qcmd.log

qcmd.log every PBS executable invocation and the response, so you can see exactly what commands are being run.

/opt/cycle/pbspro/demand.log

Every autoscale iteration, azpbs prints out a table of all of the nodes, their resources, their assigned jobs and more. This log contains these values and nothing else.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.