alan-turing-institute / spc-hpc-pipeline

Azure batch pipeline for the SPC project
MIT License
2 stars 0 forks source link
hut23 hut23-1077

Running SPC pipeline on the cloud - Microsoft Azure

It is possible to make use of Azure cloud infrastructure to run the SPC pipeline in two ways:

In order to do these, you will need the following:

Setting up your Azure config.py

In this repo there is a file config.py with place-holders that you must fill in the various fields. The necessary info can be found in the Azure portal [https://portal.azure.com/#home] - perhaps the easiest way is to navigate via the "Subscriptions" icon at the top of the portal, then find the "Resources" (i.e. the Storage Account and the Batch Account).

Once you have populated the fields in config.py, then do

pip install -r requirements.txt

from your preferred environment in the top level directory of this repo.

Preparing submodules

Run

git submodule update --init --recursive

to pull submodules required for these scripts. After this you will need to prepare a zip file with the required modules:

zip -r submodules.zip submodules/

Setting up your NOMIS API key

Ths SCP pipeline uses an API by Nomisweb which allows relatively easy programmatic access the to data. Nomisweb currently hosts the ONS principal NPP data for the UK, the SNPP data for England, and all of the MYE data.

You need to obtain a NOMIS API key and add it to the scripts/scp/NOMIS_API_KEY.txt file before running any SCP jobs.

Running the SPC pipeline on batch

The script spc-hpc-client.py is designed to create a batch Pool and assigng parallel tasks that run a given script individually for different LADs. The script has several options that can be understood by running:

python spc-hpc-client.py --help

which returns:

options:
  -h, --help            show this help message and exit
  --upload_files UPLOAD_FILES
                        Path to files to be uploaded to batch container and used to run the script.
  --submodules SUBMODULES
                        Path where submodules are stored which are used by scripts
  --script_file_name SCRIPT_FILE_NAME
                        Name of bash script to be ran on jobs, should exist in the path provided by '--upload_files'
  --lads [ALIST ...]    LADs codes to be ran in parallel, one code per task. Examples: --lads E06000001 E06000002 E06000003 E06000004
  --lads_file LADS_FILE
                        Path to CSV file containing the LAD codes to be used, under a column names "LAD20CD"

Quickstart

  1. For example, to run the SPC pipeline on 4 LADS in parallel you can run the following.

python spc-hpc-client.py --upload_files scripts/scp --script_file_name SPENSER_HPC_setup.sh --submodules submodules --lads E06000001 E06000002 E06000003 E06000004

  1. If you want to run the SPC pipeline on all the LADS in parallel you can run the following.

python spc-hpc-client.py --upload_files scripts/scp --script_file_name SPENSER_HPC_setup.sh --submodules submodules --lads_file data/new_lad_list.csv

For each case you have to make sure your POOL_NODE_COUNT variable in the config.py file is at least the number of LADs you plan to run in parallel and that your quota allows it ( in case 1. POOL_NODE_COUNT=4).

SPC pipeline output

Note that using Azure storage as detailed above is a prerequisite for using Azure batch.

For each job a time-stamped named container is created, in there you can find the following;

All task of one job save their outputs to the same container.

Checking the status of your job on Azure batch

To be added.

Downloading data from Azure storage when it is ready

For a given submission all files created by each task will be stored in the storage container belonging to the storage account defined in the config.py file. The name of the container is named something like scp-TIMESTAMP-JOB-SUBMISSION. The files produced for each LAD are stored as subdirectories named with the LAD code as described above.

To download the content of a container we use AzCopy, is a command-line tool that moves data into and out of Azure Storage. First you must donwload the tool to your machine as described here and authentify using the command AzCopy login.

Once you have logged in you can Download a directory in the following way:

azcopy copy 'https://<storage-account-name>.core.windows.net/<container-name>/<directory-path>' '<local-directory-path>' --recursive

Example in our case:

azcopy copy 'https://scpoutputs.blob.core.windows.net/scp-2022-12-08-13-04-44/' 'WalesMicrosimulation/' --recursive

You can check if the needed files for every LAD have been produced and downloaded by running a simple file counting script on the downloaded container directory. With the current version of this code 69 files have to be produced.

In the example above you can do:

cd WalesMicrosimulation/scp-2022-12-08-13-04-44/
source count_files.sh 69 

the script will return a warning for any subdirectory (LAD code name e.g. W06000002) that has a different number of files to the one given as an argument (69 in the example). As LAD subdirectory with a different file count than expected means that there was an issue on the microsimulation run for that LAD and needs to be investigated.

What is going on "under the hood" when running on Azure batch?

(This section is only necessary if you are interested in knowing more about how this works - if you just want to run the jobs, the instructions above should suffice.)

When you run the command

python spc-hpc-client.py --upload_files scripts/scp --script_file_name SPENSER_HPC_setup.sh --submodules submodules --lads E06000001 E06000002 E06000003 E06000004

The batch functionality is implemented at the LAD level and follows the next steps.

For each Task, the process is then:

What does SPENSER_HPC_setup.sh do ?

The execution of a single Task on a batch node (which is an Ubuntu-X.X VM) is governed by this shell script SPENSER_HPC_setup.sh. Which has the input arguments:

  1. The LAD to be simulated.

For this pipeline the command ran on a given task is the following:

/bin/bash SPENSER_HPC_setup.sh E06000001

The basic flow of the SPENSER_HPC_setup.sh script is:

What happens when all tasks are submitted?

The script spc-hpc-client.py will submit all the tasks and will wait for the tasks to finish. Once the tasks have finished if requests the user in the command line if they want to delete the Pool and Jobs.

Running locally

Due to the target system for this pipeline being Ubuntu 20.02 running on anything else may require tweaking and modifications. A Docker file is given to enable running of the pipeline locally with a consistent setup to azure.

To build the image run

docker build -t "dyme-spc:Dockerfile" .

then begin running

docker run --name dyme -d -t "dyme-spc:Dockerfile"

with it running you can then start a bash terminal into the container

docker exec -it dyme bash 

Here, run the spenser script with a LADs list or single LAD. In the example below the full set of Wales will be run

./SPENSER_HPC_setup.sh `awk -F "\"*,\"*" '{print substr($1,2)}' new_lad_list_Wales.csv | awk 'NR!=1 {print}'`