Closed fnattino closed 7 months ago
Running Jupyter (and Dask LocalClusters) in a singularity container works fine, following these steps:
apptainer build test-jupyterdask-image.sif docker://ghcr.io/fnattino/test-jupyterdask-image:latest
sbatch jupyter.slurm ./test-jupyterdask-image.sif
The file jupyter.slurm
can look like:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=2
#SBATCH --partition=normal
CONTAINER=$1
PORT=` shuf -i 8400-9400 -n 1 `
NODE=`hostname -s`
echo "Run the following on your local machine: "
echo "ssh -i /path/to/ssh/key -N -L 8889:${NODE}:${PORT} ${USER}@spider.surf.nl"
apptainer -d exec \
$CONTAINER \
jupyter lab --no-browser --port=${PORT} --ip=0.0.0.0
However the hack from https://gist.github.com/willirath/2176a9fa792577b269cb393995f43dda to use SLURM commands (e.g. sbatch
) from within the container does not work on Spider, presumably because SSH access to the compute nodes is blocked. Thus, using Dask JobQueue from within the container is not possible. Will confirm with somebody from the Spider team.
Adding workers manually (as carried out in https://github.com/pbranson/pangeo-hpc-singularity/tree/master) works. One can add a worker by submitting a job that starts a worker node:
# getting scheduler address from the container running Jupyter (and the Dask scheduler)
sbatch dask-worker.slurm ./test-jupyterdask-image.sif tcp://10.0.0.41:33831
Where dask-worker.slurm
looks like:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=2
#SBATCH --partition=normal
CONTAINER=$1
SCHEDULER_ADDRESS=$2
# calculate task memory limit
mempcpu=$SLURM_MEM_PER_CPU
memlim=$(echo $SLURM_CPUS_PER_TASK*$mempcpu*0.98 | bc)
apptainer -d exec \
$CONTAINER \
dask worker $SCHEDULER_ADDRESS --nthreads $SLURM_CPUS_PER_TASK --memory-limit ${memlim}M --nanny --death-timeout 600 --local-directory $TMPDIR
Conclusion: Jupyter and Dask can be easily run on a SLURM system (Spider) using containers. However, Dask Jobqueue, which allows one to start the cluster from the Jupyter interface, does not work because one cannot SSH from the container to the host (presumably because of the SSH access to the compute node being blocked).
Material now uploaded to https://github.com/RS-DAT/JupyterDask-Singularity, and followed up tasks defined in issues therein.
Can we create a Dask cluster and a Jupyter Lab session running on a SLURM system using containers? Can we maintain the "adaptive" behaviour (being able to scale workers using SLURM)? If Dask Job queue allows for the customization of commands to start scheduler and workers, we could include here the call to singularity.
Maybe a good starting point is the example of a (local) container-based Dask deployment (dask-docker), can we get similar setup running on SLURM?