RS-DAT / JupyterDaskOnSLURM

Apache License 2.0
16 stars 3 forks source link

JupyterDask Containerised #66

Closed fnattino closed 7 months ago

fnattino commented 7 months ago

Can we create a Dask cluster and a Jupyter Lab session running on a SLURM system using containers? Can we maintain the "adaptive" behaviour (being able to scale workers using SLURM)? If Dask Job queue allows for the customization of commands to start scheduler and workers, we could include here the call to singularity.

Maybe a good starting point is the example of a (local) container-based Dask deployment (dask-docker), can we get similar setup running on SLURM?

fnattino commented 7 months ago

Useful material for this task:

fnattino commented 7 months ago

Running Jupyter (and Dask LocalClusters) in a singularity container works fine, following these steps:

  1. Download an image from Docker Hub or GitHub Packages, converting it to singularity (now apptainer):
apptainer build test-jupyterdask-image.sif docker://ghcr.io/fnattino/test-jupyterdask-image:latest
  1. Start JupyterLab in a container on a compute node:
sbatch jupyter.slurm ./test-jupyterdask-image.sif

The file jupyter.slurm can look like:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=2
#SBATCH --partition=normal

CONTAINER=$1

PORT=` shuf -i 8400-9400 -n 1 `

NODE=`hostname -s`
echo "Run the following on your local machine: "
echo "ssh -i /path/to/ssh/key -N -L 8889:${NODE}:${PORT} ${USER}@spider.surf.nl"

apptainer -d exec \
        $CONTAINER \
        jupyter lab --no-browser --port=${PORT} --ip=0.0.0.0
fnattino commented 7 months ago

However the hack from https://gist.github.com/willirath/2176a9fa792577b269cb393995f43dda to use SLURM commands (e.g. sbatch) from within the container does not work on Spider, presumably because SSH access to the compute nodes is blocked. Thus, using Dask JobQueue from within the container is not possible. Will confirm with somebody from the Spider team.

fnattino commented 7 months ago

Adding workers manually (as carried out in https://github.com/pbranson/pangeo-hpc-singularity/tree/master) works. One can add a worker by submitting a job that starts a worker node:

# getting scheduler address from the container running Jupyter (and the Dask scheduler)
sbatch dask-worker.slurm ./test-jupyterdask-image.sif tcp://10.0.0.41:33831

Where dask-worker.slurm looks like:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=2
#SBATCH --partition=normal

CONTAINER=$1
SCHEDULER_ADDRESS=$2

# calculate task memory limit
mempcpu=$SLURM_MEM_PER_CPU
memlim=$(echo $SLURM_CPUS_PER_TASK*$mempcpu*0.98 | bc)

apptainer -d exec \
        $CONTAINER \
        dask worker $SCHEDULER_ADDRESS --nthreads $SLURM_CPUS_PER_TASK --memory-limit ${memlim}M --nanny --death-timeout 600 --local-directory $TMPDIR
fnattino commented 7 months ago

Conclusion: Jupyter and Dask can be easily run on a SLURM system (Spider) using containers. However, Dask Jobqueue, which allows one to start the cluster from the Jupyter interface, does not work because one cannot SSH from the container to the host (presumably because of the SSH access to the compute node being blocked).

fnattino commented 7 months ago

Material now uploaded to https://github.com/RS-DAT/JupyterDask-Singularity, and followed up tasks defined in issues therein.