Quansight / lsst_dashboard

LSST Dashboard https://quansight.github.io/lsst_dashboard/
BSD 3-Clause "New" or "Revised" License
8 stars 3 forks source link

Performance: use multiple dask 'processes' per compute node. #159

Closed dharhas closed 4 years ago

dharhas commented 4 years ago

It looks like the dashboard performance is being impacted by us running 1 dask worker (or process) and 24 threads per compute node. This is probably because a lot of the computations are being blocked by the GIL.

Unfortunately fixing this looks like it might be a bit involved when combined with the fact that we are required to run the Dask Scheduler and Workers with specific port ranges on lsst-dev. The combination of multiple process per compute node and specific port ranges is currently not supported by dask-jobqueue.

Relevant issues:

https://github.com/dask/distributed/issues/899 https://github.com/dask/distributed/issues/2471

timothydmorton commented 4 years ago

We are also limited by the slurm configuration at the lsst cluster, which doesn't allow multiple process per node anyway.

However, we have been promised that we can have an early crack at the HTCondor cluster that they are working on setting up. I haven't heard anything about that recently, but pinged again about it today.

dharhas commented 4 years ago

I was able to get multiple processes per node working, and see the dask cluster but dask was having issues with tasks dying. There is one other thing I need to try with regaurds to explicitly setting the dask Nanny port. If that works, I'll post my slurm code here.

dharhas commented 4 years ago

Workaround that does not use dask-jobqueue:

dask-scheduler --port 29000 --dashboard-address :20500

and then use a startcluster.sh scripts

#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -p normal
#SBATCH --nodes=2
#SBATCH --ntasks=12
#SBATCH --cpus-per-task=4
#SBATCH --mem=120G
#SBATCH -t 00:30:00
for i in {29001..29012}; do
    srun --exclusive -N 1 -n 1 /software/lsstsw/stack_20200220/python/miniconda3-4.7.12/envs/lsst-scipipe/bin/python -m distributed.cli.dask_worker tcp://141.142.237.49:29000 --nthreads 4 --nprocs 1 --memory-limit 20GB --name ${SLURM_JOB_ID}${i} --no-nanny --death-timeout 60 --worker-port ${i} >& slurm-${SLURM_JOB_ID}-task-${i}.out &
done
wait

and then

sbatch startcluster.sh

and then connect the client

from dask.distributed import Client
client = Client('127.0.0.1:29000') 
from lsst_dashboard.gui import dashboard
dashboard.render().show(port=20000)
dharhas commented 4 years ago

This is fixed in https://github.com/dask/distributed/pull/3704, should be available in the next release of distributed.

in the meantime you can install with

pip install --user git+https://github.com/jrbourbeau/distributed.git@port-range

and use my branch https://github.com/Quansight/lsst_dashboard/tree/precompute-dataframes/lsst_dashboard for the changes that need to be made in __main__.py