Closed dharhas closed 4 years ago
We are also limited by the slurm configuration at the lsst cluster, which doesn't allow multiple process per node anyway.
However, we have been promised that we can have an early crack at the HTCondor cluster that they are working on setting up. I haven't heard anything about that recently, but pinged again about it today.
I was able to get multiple processes per node working, and see the dask cluster but dask was having issues with tasks dying. There is one other thing I need to try with regaurds to explicitly setting the dask Nanny port. If that works, I'll post my slurm code here.
Workaround that does not use dask-jobqueue:
dask-scheduler --port 29000 --dashboard-address :20500
and then use a startcluster.sh scripts
#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -p normal
#SBATCH --nodes=2
#SBATCH --ntasks=12
#SBATCH --cpus-per-task=4
#SBATCH --mem=120G
#SBATCH -t 00:30:00
for i in {29001..29012}; do
srun --exclusive -N 1 -n 1 /software/lsstsw/stack_20200220/python/miniconda3-4.7.12/envs/lsst-scipipe/bin/python -m distributed.cli.dask_worker tcp://141.142.237.49:29000 --nthreads 4 --nprocs 1 --memory-limit 20GB --name ${SLURM_JOB_ID}${i} --no-nanny --death-timeout 60 --worker-port ${i} >& slurm-${SLURM_JOB_ID}-task-${i}.out &
done
wait
and then
sbatch startcluster.sh
and then connect the client
from dask.distributed import Client
client = Client('127.0.0.1:29000')
from lsst_dashboard.gui import dashboard
dashboard.render().show(port=20000)
This is fixed in https://github.com/dask/distributed/pull/3704, should be available in the next release of distributed.
in the meantime you can install with
pip install --user git+https://github.com/jrbourbeau/distributed.git@port-range
and use my branch https://github.com/Quansight/lsst_dashboard/tree/precompute-dataframes/lsst_dashboard for the changes that need to be made in __main__.py
It looks like the dashboard performance is being impacted by us running 1 dask worker (or process) and 24 threads per compute node. This is probably because a lot of the computations are being blocked by the GIL.
Unfortunately fixing this looks like it might be a bit involved when combined with the fact that we are required to run the Dask Scheduler and Workers with specific port ranges on lsst-dev. The combination of multiple process per compute node and specific port ranges is currently not supported by dask-jobqueue.
Relevant issues:
https://github.com/dask/distributed/issues/899 https://github.com/dask/distributed/issues/2471