Open GenevieveBuckley opened 3 years ago
These resources don't necessarily answer the question about how to choose good settings, but might be good to link to:
It'd be good to collect other, non-SLURM links too
Thanks @GenevieveBuckley for starting the discussion.
In my experience, the thing users have the most difficulties to understand is how to configure the JobQueueCluster (be it PBS, Slurm or whatever) correctly, and what do the kwargs mean. More specifically:
process
, cores
, memory
, how do they change the job configuration and the dask-worker configuration?I think one reason this is particularly confusing is that settings often need to be defined in multiple locations, and people are confused about how they interact
With this, there is also the dask-config Yaml file vs the kwargs. Which to use and when?
For example, someone might submit a job to SLURM with sbatch, which then runs a python program involving Dask, and want to know how that fits together.
I agree, we need also to describe different possibilities and "big picture":
JobQueueCluster
on the login/front node.JobQueueCluster
.And we could also add improvements to be made, or point to https://blog.dask.org/2019/06/12/dask-on-hpc which presents a lot of things that are still true. And maybe try to develop point 7, at the end of the post.
That is an excellent and thorough summary @guillaumeeb!
We also might add:
how to configure the JobQueueCluster (be it PBS, Slurm or whatever) correctly, and what do the kwargs mean.
Building on "what do the kwargs mean", it would be good if we could not only explain each concept, but also map it to the words used for the same concept in other places. Suggesting this because it's the type of question I get - someone has read all the beginner documentation and asks "Is $foo the same as $bar? Does that mean I should set these values to the same thing?"
It'd be good to have a blogpost about how to choose good settings for Dask on HPC. Users are often confused about this.
I think one reason this is particularly confusing is that settings often need to be defined in multiple locations, and people are confused about how they interact. For example, someone might submit a job to SLURM with sbatch, which then runs a python program involving Dask, and want to know how that fits together.
https://github.com/dask/dask-blog/issues/116#issuecomment-947370655
@guillaumeeb has kindly agreed to help put this together https://github.com/dask/dask-blog/issues/116#issuecomment-947955079