ALRhub / clusterduck

clusterduck is a hydra launcher plugin for running jobs in batches on a SLURM cluster. It is intended for small tasks on clusters where jobs have exclusive access to a node, such that submitting a single task to a node would be wasteful.
12 stars 1 forks source link

clusterduck

clusterduck is a hydra launcher plugin for running jobs in batches on a SLURM cluster. It is intended for small tasks on clusters where jobs have exclusive access to a node, such that submitting a single task to a node would be wasteful.

Installation

Install clusterduck with pip install .

pip install .

Developers should note that Hydra plugins are not compatible with new PEP 660-style editable installs. In order to perform an editable install, either use compatibility mode:

pip install -e . --config-settings editable_mode=compat

or use strict editable mode.

pip install -e . --config-settings editable_mode=strict

Be aware that strict mode installs do not expose new files created in the project until the installation is performed again.

Examples

The example script requires a few additional dependencies. Install with:

pip install ".[examples]"

To run the example script locally, e.g. looping over both model types twice each, use:

python example/train.py --multirun model=convnet,transformer +iteration="range(2)"

To run the example script with the submitit backend but locally without a cluster, specify the platform like this:

python example/train.py --multirun model=convnet,transformer +iteration="range(2)" +platform=slurm_debug

To run the example script on the HoreKa cluster, use:

python example/train.py --multirun model=convnet,transformer +iteration="range(2)" +platform=horeka

Configuration Options

This plugin is heavily inspired by the hydra-submitit-launcher plugin, and provides all parameters of that original plugin. See their documentation for details about those parameters.

Both plugins rely on submitit for the real heavy lifting. See their documentation for more information.

Additional Parameters

The following parameters are added by this plugin:

We refer to a hydra job, i.e. one execution of the hydra main function with a set of overrides, as a run, to differentiate it from both jobs and tasks as defined by SLURM.

Here an example of a hydra/launcher config for Horeka that uses some of the above options:

hydra:
  launcher:
    # launcher/cluster specific options
    timeout_min: 5
    partition: accelerated
    gres: gpu:4
    setup:
      # Create wandb folder in fast, job-local storage: https://www.nhr.kit.edu/userdocs/horeka/filesystems/#tmpdir
      # NOTE: wandb folder will be deleted after job completion, but by then it will have synced with server
      - export WANDB_DIR=$TMPDIR/wandb
      - mkdir -pv $WANDB_DIR
      - export WANDB_CONSOLE=off

    # clusterduck specific options
    parallel_runs_per_node: 4
    total_runs_per_node: 8
    resources_config:
      cpu:
      cuda:
      rendering:
      stagger:
        delay: 5

Further look into the example folder for a working example with multiple example configurations.

Development

PyCUDA is a helpful tool for working with CUDA devices outside of the context of a machine learning library like pytorch. We recommend installing it with conda:

conda install pycuda

Install additional requirements for development using:

pip install ".[all]"

Other Sweepers

clusterduck plays nicely with other Hydra sweeper plugins, for example Optuna. You can find a small example of how to use clusterduck with Optuna in example/conf/optim/optuna.yaml.

To run the example, install the additional dependencies with:

pip install hydra-optuna-sweeper

To run the example with the default Hydra launcher, run:

python example/train.py +optim=optuna

To run the example with clusterduck, run:

python example/train.py +optim=optuna_clusterduck +platform=slurm_debug