JupyterDask Containerised #66

fnattino commented 7 months ago

Can we create a Dask cluster and a Jupyter Lab session running on a SLURM system using containers? Can we maintain the "adaptive" behaviour (being able to scale workers using SLURM)? If Dask Job queue allows for the customization of commands to start scheduler and workers, we could include here the call to singularity.

Maybe a good starting point is the example of a (local) container-based Dask deployment (dask-docker), can we get similar setup running on SLURM?

fnattino commented 7 months ago

Useful material for this task:

fnattino commented 7 months ago

Running Jupyter (and Dask LocalClusters) in a singularity container works fine, following these steps:

  1. Download an image from Docker Hub or GitHub Packages, converting it to singularity (now apptainer):
apptainer build test-jupyterdask-image.sif docker://
  1. Start JupyterLab in a container on a compute node:
sbatch jupyter.slurm ./test-jupyterdask-image.sif

The file jupyter.slurm can look like:

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=2
#SBATCH --partition=normal


PORT=` shuf -i 8400-9400 -n 1 `

NODE=`hostname -s`
echo "Run the following on your local machine: "
echo "ssh -i /path/to/ssh/key -N -L 8889:${NODE}:${PORT} ${USER}"

apptainer -d exec \
        $CONTAINER \
        jupyter lab --no-browser --port=${PORT} --ip=
fnattino commented 7 months ago

However the hack from to use SLURM commands (e.g. sbatch) from within the container does not work on Spider, presumably because SSH access to the compute nodes is blocked. Thus, using Dask JobQueue from within the container is not possible. Will confirm with somebody from the Spider team.

fnattino commented 7 months ago

Adding workers manually (as carried out in works. One can add a worker by submitting a job that starts a worker node:

# getting scheduler address from the container running Jupyter (and the Dask scheduler)
sbatch dask-worker.slurm ./test-jupyterdask-image.sif tcp://

Where dask-worker.slurm looks like:

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=2
#SBATCH --partition=normal


# calculate task memory limit
memlim=$(echo $SLURM_CPUS_PER_TASK*$mempcpu*0.98 | bc)

apptainer -d exec \
        $CONTAINER \
        dask worker $SCHEDULER_ADDRESS --nthreads $SLURM_CPUS_PER_TASK --memory-limit ${memlim}M --nanny --death-timeout 600 --local-directory $TMPDIR
fnattino commented 7 months ago

Conclusion: Jupyter and Dask can be easily run on a SLURM system (Spider) using containers. However, Dask Jobqueue, which allows one to start the cluster from the Jupyter interface, does not work because one cannot SSH from the container to the host (presumably because of the SSH access to the compute node being blocked).

fnattino commented 7 months ago

Material now uploaded to, and followed up tasks defined in issues therein.