ecohealthalliance / slurm-containerized-hpc-environment

The goal of this project is to outline the strategies for integrating Simple Linux Utility for Resource Management(SLURM) into our existing High-Performance Computing(HPC) platform.
0 stars 0 forks source link

slurm-containerized-hpc-environment

Docker Build and Push Docker Images

The goal of this project is to outline the strategies for integrating Simple Linux Utility for Resource Management(SLURM) into our existing High-Performance Computing(HPC) platform. Steps

Contents and context

  1. base - Slurm base image from which other components are derived
  2. controller + database - Slurm controller (head-node) definition and database for accounting information)
  3. worker - Slurm worker (compute-node) definition

Container Overview

An example docker-compose.yml file is provided that builds and deploys

Listing of participating containers with FQDNs and their function within the cluster.

Container Function FQDN
controller Slurm Primary Controller controller.local.dev
worker01 Slurm Worker worker01.local.dev
worker02 Slurm Worker worker02.local.dev
worker03 Slurm Worker worker03.local.dev
worker04 Slurm Worker worker04.local.dev

All images defined in docker-compose.yml will be built from the espirado/slurm.base:23.02 base image

Usage

An example docker-compose.yml file is provided that builds and deploys (-d is used to daemonize the call).

docker-compose up -d

Four containers should be observed running when completed

$ docker ps
CONTAINER ID        IMAGE                                COMMAND                  CREATED             STATUS              PORTS                                              NAMES
995183e9391e        espirado/slurm.worker:23.02       "/usr/local/bin/tini…"   10 seconds ago      Up 30 seconds       22/tcp, 3306/tcp, 6817-6819/tcp, 60001-63000/tcp   8787/tcp worker01
a8382a486989        espirado/slurm.worker:23.02       "/usr/local/bin/tini…"   10 seconds ago      Up 30 seconds       22/tcp, 3306/tcp, 6817-6819/tcp, 60001-63000/tcp   8787/tcp worker02
24e951854109        espirado/slurm.controller:23.02   "/usr/local/bin/tini…"   11 seconds ago      Up 31 seconds       22/tcp, 3306/tcp, 6817-6819/tcp,    0.0.0.0:8787->8787/tcp controller

Examples using Slurm

The examples make use of the following commands.

controller

Visit localhost:8787 RSTUDIO then log to Rstudio interface with rstudio:rstudio as username and password

Use the docker exec call to gain a shell on the controller container.

$ docker exec -ti controller /bin/bash
[root@controller /]#

Issue an sinfo call

# sinfo -lN
Wed Apr 11 21:15:35 2018
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
worker01       1   docker*        idle    1    1:1:1   1998        0      1   (null) none
worker02       1   docker*        idle    1    1:1:1   1998        0      1   (null) none

Create a rstudio account and rstudio user in Slurm

# sacctmgr -i add account rstudio description="test account" Organization=eco-health
 Adding Account(s)
  rstudio
 Settings
  Description     = test account
  Organization    = eco-health
 Associations
  A = worker     C = snowflake
 Settings
  Parent        = root

# sacctmgr -i create user rstudio account=test adminlevel=None
 Adding User(s)
  rstudio
 Settings =
  Admin Level     = None
 Associations =
  U = rstudio    A = test     C = snowflake
 Non Default Settings

database

Use the docker exec call to gain a MariaDB/MySQL shell on the database container.

$ docker exec -ti database mysql -uslurm -ppassword -hdatabase.local.dev
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 9
Server version: 5.5.56-MariaDB MariaDB Server

Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]>

Checkout the slurm_acct_db database and it's tables

MariaDB [(none)]> use slurm_acct_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [slurm_acct_db]> show tables;
+-----------------------------------+
| Tables_in_slurm_acct_db           |
+-----------------------------------+
| acct_coord_table                  |
| acct_table                        |
| clus_res_table                    |
| cluster_table                     |
| convert_version_table             |
| federation_table                  |
| qos_table                         |
| res_table                         |
| snowflake_assoc_table             |
| snowflake_assoc_usage_day_table   |
| snowflake_assoc_usage_hour_table  |
| snowflake_assoc_usage_month_table |
| snowflake_event_table             |
| snowflake_job_table               |
| snowflake_last_ran_table          |
| snowflake_resv_table              |
| snowflake_step_table              |
| snowflake_suspend_table           |
| snowflake_usage_day_table         |
| snowflake_usage_hour_table        |
| snowflake_usage_month_table       |
| snowflake_wckey_table             |
| snowflake_wckey_usage_day_table   |
| snowflake_wckey_usage_hour_table  |
| snowflake_wckey_usage_month_table |
| table_defs_table                  |
| tres_table                        |
| txn_table                         |
| user_table                        |
+-----------------------------------+
29 rows in set (0.00 sec)

Validate that the rstudio user was entered into the database

MariaDB [slurm_acct_db]> select * from user_table;
+---------------+------------+---------+--------+-------------+
| creation_time | mod_time   | deleted | name   | admin_level |
+---------------+------------+---------+--------+-------------+
|    1523481120 | 1523481120 |       0 | root   |           3 |
|    1523481795 | 1523481795 |       0 | rstudio|           1 |
+---------------+------------+---------+--------+-------------+
2 rows in set (0.00 sec)

worker01 and worker02

Use the docker exec call to gain a shell on either the worker01 or worker02or worker03 or worker04 container and become the user worker.

$ docker exec -ti -u worker worker01 /bin/bash
[worker@worker01 /]$ cd ~
[worker@worker01 ~]$ pwd
/home/worker

Test password-less ssh between containers

[worker@worker01 ~]$ hostname
worker01.local.dev
[worker@worker01 ~]$ ssh worker02
[worker@worker02 ~]$ hostname
worker02.local.dev
[worker@worker02 ~]$ ssh controller
[worker@controller ~]$ hostname
controller.local.dev

Slurm commands

All commands are issued as the user worker from the controller node

$ docker exec -ti -u worker controller /bin/bash
[worker@controller /]$ cd ~
[worker@controller ~]$ pwd
/home/worker

Test the sacct and srun calls

$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
$ srun -N 2 hostname
worker01.local.dev
worker02.local.dev
$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
2              hostname     docker     worker          2  COMPLETED      0:0

Test the sbatch call

Make a job file named: slurm_test.job

#!/bin/bash

#SBATCH --job-name=SLURM_TEST
#SBATCH --output=SLURM_TEST.out
#SBATCH --error=SLURM_TEST.err
#SBATCH --partition=docker

srun hostname | sort

Run the job using sbatch

$ sbatch -N 2 slurm_test.job
Submitted batch job 3

Check the sacct output

$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
2              hostname     docker     worker          2  COMPLETED      0:0
3            SLURM_TEST     docker     worker          2  COMPLETED      0:0
3.batch           batch                worker          1  COMPLETED      0:0
3.0            hostname                worker          2  COMPLETED      0:0

Check the output files

$ ls -1
SLURM_TEST.err
SLURM_TEST.out
slurm_test.job
$ cat SLURM_TEST.out
worker01.local.dev
worker02.local.dev

Test the sbatch --array and squeue calls

Make a job file named array_test.job:

#!/bin/bash

#SBATCH -N 1
#SBATCH -c 1
#SBATCH -t 24:00:00
###################
## %A == SLURM_ARRAY_JOB_ID
## %a == SLURM_ARRAY_TASK_ID (or index)
## %N == SLURMD_NODENAME (directories made ahead of time)
#SBATCH -o %N/%A_%a_out.txt
#SBATCH -e %N/%A_%a_err.txt

snooze=$(( ( RANDOM % 10 )  + 1 ))
echo "$(hostname) is snoozing for ${snooze} seconds..."

sleep $snooze

This job defines output directories as being %N which reflect the SLURMD_NODENAME variable. The output directories will need to exist ahead of time in this particular case, and can be determined by finding all available nodes in the NODELIST and creating the directories.

$ sinfo -N
NODELIST   NODES PARTITION STATE
worker01       1   docker* idle
worker02       1   docker* idle
$ mkdir worker01 worker02

The job when run will direct it's output files to the directory defined by the node on which it is running. Each iteration will sleep from 1 to 10 seconds randomly before moving onto the next run in the array.

We will run an array of 20 jobs, 2 at a time, until the array is completed. The status can be found using the squeue command.

$ sbatch --array=1-20%2 array_test.job
Submitted batch job 4
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        4_[3-20%2]    docker array_te   worker PD       0:00      1 (JobArrayTaskLimit)
               4_1    docker array_te   worker  R       0:01      1 worker01
               4_2    docker array_te   worker  R       0:01      1 worker02
...
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          4_[20%2]    docker array_te   worker PD       0:00      1 (JobArrayTaskLimit)
              4_19    docker array_te   worker  R       0:04      1 worker02
              4_18    docker array_te   worker  R       0:10      1 worker01
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

Looking into each of the worker01 and worker02 directories we can see which jobs were run on each node.

$ ls
SLURM_TEST.err  array_test.job  worker01
SLURM_TEST.out  slurm_test.job  worker02
$ ls worker01
4_11_err.txt  4_16_err.txt  4_1_err.txt   4_3_err.txt  4_7_err.txt
4_11_out.txt  4_16_out.txt  4_1_out.txt   4_3_out.txt  4_7_out.txt
4_14_err.txt  4_18_err.txt  4_20_err.txt  4_5_err.txt  4_9_err.txt
4_14_out.txt  4_18_out.txt  4_20_out.txt  4_5_out.txt  4_9_out.txt
$ ls worker02
4_10_err.txt  4_13_err.txt  4_17_err.txt  4_2_err.txt  4_6_err.txt
4_10_out.txt  4_13_out.txt  4_17_out.txt  4_2_out.txt  4_6_out.txt
4_12_err.txt  4_15_err.txt  4_19_err.txt  4_4_err.txt  4_8_err.txt
4_12_out.txt  4_15_out.txt  4_19_out.txt  4_4_out.txt  4_8_out.txt

And looking at each *_out.txt file view the output

$ cat worker01/4_14_out.txt
worker01.local.dev is snoozing for 10 seconds...
$ cat worker02/4_6_out.txt
worker02.local.dev is snoozing for 7 seconds...

Using the sacct call we can see when each job in the array was executed

$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
2              hostname     docker     worker          2  COMPLETED      0:0
3            SLURM_TEST     docker     worker          2  COMPLETED      0:0
3.batch           batch                worker          1  COMPLETED      0:0
3.0            hostname                worker          2  COMPLETED      0:0
4_20         array_tes+     docker     worker          1  COMPLETED      0:0
4_20.batch        batch                worker          1  COMPLETED      0:0
4_1          array_tes+     docker     worker          1  COMPLETED      0:0
4_1.batch         batch                worker          1  COMPLETED      0:0
4_2          array_tes+     docker     worker          1  COMPLETED      0:0
4_2.batch         batch                worker          1  COMPLETED      0:0
4_3          array_tes+     docker     worker          1  COMPLETED      0:0
4_3.batch         batch                worker          1  COMPLETED      0:0
4_4          array_tes+     docker     worker          1  COMPLETED      0:0
4_4.batch         batch                worker          1  COMPLETED      0:0
4_5          array_tes+     docker     worker          1  COMPLETED      0:0
4_5.batch         batch                worker          1  COMPLETED      0:0
4_6          array_tes+     docker     worker          1  COMPLETED      0:0
4_6.batch         batch                worker          1  COMPLETED      0:0
4_7          array_tes+     docker     worker          1  COMPLETED      0:0
4_7.batch         batch                worker          1  COMPLETED      0:0
4_8          array_tes+     docker     worker          1  COMPLETED      0:0
4_8.batch         batch                worker          1  COMPLETED      0:0
4_9          array_tes+     docker     worker          1  COMPLETED      0:0
4_9.batch         batch                worker          1  COMPLETED      0:0
4_10         array_tes+     docker     worker          1  COMPLETED      0:0
4_10.batch        batch                worker          1  COMPLETED      0:0
4_11         array_tes+     docker     worker          1  COMPLETED      0:0
4_11.batch        batch                worker          1  COMPLETED      0:0
4_12         array_tes+     docker     worker          1  COMPLETED      0:0
4_12.batch        batch                worker          1  COMPLETED      0:0
4_13         array_tes+     docker     worker          1  COMPLETED      0:0
4_13.batch        batch                worker          1  COMPLETED      0:0
4_14         array_tes+     docker     worker          1  COMPLETED      0:0
4_14.batch        batch                worker          1  COMPLETED      0:0
4_15         array_tes+     docker     worker          1  COMPLETED      0:0
4_15.batch        batch                worker          1  COMPLETED      0:0
4_16         array_tes+     docker     worker          1  COMPLETED      0:0
4_16.batch        batch                worker          1  COMPLETED      0:0
4_17         array_tes+     docker     worker          1  COMPLETED      0:0
4_17.batch        batch                worker          1  COMPLETED      0:0
4_18         array_tes+     docker     worker          1  COMPLETED      0:0
4_18.batch        batch                worker          1  COMPLETED      0:0
4_19         array_tes+     docker     worker          1  COMPLETED      0:0
4_19.batch        batch                worker          1  COMPLETED      0:0

Examples using MPI

The examples make use of the following commands.

controller

All commands are issued as the user worker or rstudio from the controller node

$ docker exec -ti -u worker controller /bin/bash
[worker@controller /]$ cd ~
[worker@controller ~]$ pwd
/home/worker

Available implementions of MPI

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: openmpi

About Open MPI

$ ompi_info
                 Package: Open MPI root@a6fd2549e449 Distribution
                Open MPI: 3.0.1
  Open MPI repo revision: v3.0.1
   Open MPI release date: Mar 29, 2018
                Open RTE: 3.0.1
  Open RTE repo revision: v3.0.1
   Open RTE release date: Mar 29, 2018
                    OPAL: 3.0.1
      OPAL repo revision: v3.0.1
       OPAL release date: Mar 29, 2018
                 MPI API: 3.1.0
            Ident string: 3.0.1
                  Prefix: /usr
 Configured architecture: x86_64-redhat-linux-gnu
          Configure host: a6fd2549e449
           Configured by: root
           Configured on: Fri Apr 13 02:32:11 UTC 2018
          Configure host: a6fd2549e449
  Configure command line: '--build=x86_64-redhat-linux-gnu'
                          '--host=x86_64-redhat-linux-gnu'
                          '--program-prefix=' '--disable-dependency-tracking'
                          '--prefix=/usr' '--exec-prefix=/usr'
                          '--bindir=/usr/bin' '--sbindir=/usr/sbin'
                          '--sysconfdir=/etc' '--datadir=/usr/share'
                          '--includedir=/usr/include' '--libdir=/usr/lib64'
                          '--libexecdir=/usr/libexec' '--localstatedir=/var'
                          '--sharedstatedir=/var/lib'
                          '--mandir=/usr/share/man'
                          '--infodir=/usr/share/info' '--with-slurm'
                          '--with-pmi' '--with-libfabric='
                          'LDFLAGS=-Wl,--build-id -Wl,-rpath -Wl,/lib64
                          -Wl,--enable-new-dtags'
...

Hello world using mpi_hello.out

Create a new file called mpi_hello.c in /home/worker and compile it:

/******************************************************************************
 * * FILE: mpi_hello.c
 * * DESCRIPTION:
 * *   MPI tutorial example code: Simple hello world program
 * * AUTHOR: Blaise Barney
 * * LAST REVISED: 03/05/10
 * ******************************************************************************/
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#define  MASTER 0

int main (int argc, char *argv[]) {
   int   numtasks, taskid, len;
   char hostname[MPI_MAX_PROCESSOR_NAME];

   MPI_Init(&argc, &argv);
   MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
   MPI_Comm_rank(MPI_COMM_WORLD,&taskid);
   MPI_Get_processor_name(hostname, &len);

   printf ("Hello from task %d on %s!\n", taskid, hostname);

   if (taskid == MASTER)
      printf("MASTER: Number of MPI tasks is: %d\n",numtasks);

   //while(1) {}

   MPI_Finalize();
}
$ mpicc mpi_hello.c -o mpi_hello.out
$ ls | grep mpi
mpi_hello.c
mpi_hello.out

Test mpi_hello.out using the MPI versions avalaible on the system with srun

Run a batch array with a sleep to observe the queue

Create a file named mpi_batch.job in /home/worker (similar to the script used for the sbatch --array example from above, and make an output directory named mpi_out)

file mpi_batch.job:

#!/bin/bash

#SBATCH -N 1
#SBATCH -c 1
#SBATCH -t 24:00:00
###################
## %A == SLURM_ARRAY_JOB_ID
## %a == SLURM_ARRAY_TASK_ID (or index)
#SBATCH -o mpi_out/%A_%a_out.txt
#SBATCH -e mpi_out/%A_%a_err.txt

snooze=$(( ( RANDOM % 10 )  + 1 ))
sleep $snooze

srun -N 2 --mpi=openmpi mpi_hello.out

Make directory mpi_out

$ mkdir mpi_out

Run an sbatch array of 5 jobs, one at a time, using both nodes.

$ sbatch -N 2 --array=1-5%1 mpi_batch.job
Submitted batch job 10
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        10_[2-5%1]    docker mpi_batc   worker PD       0:00      2 (JobArrayTaskLimit)
              10_1    docker mpi_batc   worker  R       0:03      2 worker[01-02]
$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
...
10_[2-5%1]   mpi_batch+     docker     worker          2    PENDING      0:0
10_1         mpi_batch+     docker     worker          2  COMPLETED      0:0
10_1.batch        batch                worker          1  COMPLETED      0:0
10_1.0       mpi_hello+                worker          2  COMPLETED      0:0

...

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        10_[4-5%1]    docker mpi_batc   worker PD       0:00      2 (JobArrayTaskLimit)
              10_3    docker mpi_batc   worker  R       0:05      2 worker[01-02]
$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
...
10_[4-5%1]   mpi_batch+     docker     worker          2    PENDING      0:0
10_1         mpi_batch+     docker     worker          2  COMPLETED      0:0
10_1.batch        batch                worker          1  COMPLETED      0:0
10_1.0       mpi_hello+                worker          2  COMPLETED      0:0
10_2         mpi_batch+     docker     worker          2  COMPLETED      0:0
10_2.batch        batch                worker          1  COMPLETED      0:0
10_2.0       mpi_hello+                worker          2  COMPLETED      0:0
10_3         mpi_batch+     docker     worker          2  COMPLETED      0:0
10_3.batch        batch                worker          1  COMPLETED      0:0
10_3.0       mpi_hello+                worker          2  COMPLETED      0:0

...

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
...
10_5         mpi_batch+     docker     worker          2  COMPLETED      0:0
10_5.batch        batch                worker          1  COMPLETED      0:0
10_5.0       mpi_hello+                worker          2  COMPLETED      0:0
10_1         mpi_batch+     docker     worker          2  COMPLETED      0:0
10_1.batch        batch                worker          1  COMPLETED      0:0
10_1.0       mpi_hello+                worker          2  COMPLETED      0:0
10_2         mpi_batch+     docker     worker          2  COMPLETED      0:0
10_2.batch        batch                worker          1  COMPLETED      0:0
10_2.0       mpi_hello+                worker          2  COMPLETED      0:0
10_3         mpi_batch+     docker     worker          2  COMPLETED      0:0
10_3.batch        batch                worker          1  COMPLETED      0:0
10_3.0       mpi_hello+                worker          2  COMPLETED      0:0
10_4         mpi_batch+     docker     worker          2  COMPLETED      0:0
10_4.batch        batch                worker          1  COMPLETED      0:0
10_4.0       mpi_hello+                worker          2  COMPLETED      0:0

Check the mpi_out output directory

$ ls mpi_out/
10_1_err.txt  10_2_err.txt  10_3_err.txt  10_4_err.txt  10_5_err.txt
10_1_out.txt  10_2_out.txt  10_3_out.txt  10_4_out.txt  10_5_out.txt
$ cat mpi_out/10_3_out.txt
Hello from task 1 on worker02.local.dev!
Hello from task 0 on worker01.local.dev!
MASTER: Number of MPI tasks is: 2

Tear down

The containers, networks, and volumes associated with the cluster can be torn down by simply running:

./teardown.sh

Each step of this teardown may also be run individually:

The containers can be stopped and removed using docker-compose

$ docker-compose stop
Stopping worker01   ... done
Stopping database   ... done
Stopping worker02   ... done
Stopping controller ... done
$ docker-compose rm -f
Going to remove worker01, database, worker02, controller
Removing worker01   ... done
Removing database   ... done
Removing worker02   ... done
Removing controller ... done

The network and volumes can be removed using their representative docker commands

References

Slurm workload manager: https://www.schedmd.com/index.php

Docker: https://www.docker.com

OpenMPI: https://www.open-mpi.org

Lmod: http://lmod.readthedocs.io/en/latest/index.html