RS-DAT / JupyterDaskCloud2Cluster

Material to deploy Jupyter and Dask on SRC and SLURM cluster
Apache License 2.0
0 stars 0 forks source link

Cross-system Dask #1

Open fnattino opened 8 months ago

fnattino commented 8 months ago

Can we run Dask Jobqueue outside the SLURM system (e.g. on SRC) and have workers submitted to SLURM? Dask jobqueue uses sbatch/scancel to manage jobs, can one can provide custom commands that involve connecting to the remote system to submit/delete jobs? In the worst case, one can create aliases for sbatch/scancel that do the job?

Questions:

fnattino commented 7 months ago

Ideas to setup SSH connections to SLURM system from SRC:

fnattino commented 7 months ago

Two issues in the "standard" setup with Dask Jobqueue running on SRC (with a local scheduler) and workers running on the SLURM system:

fnattino commented 7 months ago

Possible solution to the first issue above: in the sbatch alias, copy the temporary file to the SLURM system and submit it there. Seems to work (jobs get submitted).

fnattino commented 7 months ago

Possible solution to the second issue above (tested):

fnattino commented 7 months ago

Update on above, now also working on compute nodes, but hacky and likely over-complicated:

fnattino commented 7 months ago

Useful resource for SSH tunnels: https://iximiuz.com/en/posts/ssh-tunnels/

fnattino commented 7 months ago

Apart of the SSH forwarding-madness, the following configure most of the elements for Dask Jobqueue:

from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
    scheduler_options={
        'port': 8786, 
        'host': 'localhost',
        'contact_address': 'tcp://ui-01:8787', 
    },
    cores=1,
    memory='8GiB',
    queue='normal',
    processes=1,
    death_timeout=600,
    local_directory='$TMPDIR',
    walltime='1:00:00',
    job_script_prologue=[
        'APPTAINER_TMPDIR=${TMPDIR}',
    ],
    python='apptainer exec oras://ghcr.io/fnattino/test-jupyterdask-image-apptainer:latest python',
)

with bin/sbatch:

#!/bin/bash 

content=`cat $@`

ssh -t <USERNAME>@spider.surf.nl "
TMPFILE=`mktemp`
cat << 'EOF' > \$TMPFILE
${content}
EOF
sbatch \$TMPFILE
"

bin/scancel:

#!/bin/bash 

ssh -q -t <USERNAME>@spider.surf.nl "scancel $@"

with bin/squeue:

#!/bin/bash 

ssh -q -t <USERNAME>@spider.surf.nl "squeue $@"

Fields that I cannot seem to set directly via the Dask JobQueue interface (but could be set via worker_extra_args) are: