Dask on HPC, what works and what doesn't

mrocklin commented 5 years ago

Hi All,

I'd like for a group of us to write a blogpost about using Dask on supercomputers, including why we like it today, and highlighting improvements that could be done in the near future to improve usability. My goal for this post is to show it around to various HPC groups, and to show it to my employer to motivate work in this area. I think that now is a good time for this community to have some impact by sharing its recent experience.

cc'ing some notable users today @guillaumeeb @jhamman @kmpaul @lesteve @dharhas @josephhardinee @jakirkham

To start conversation, if we were to structure the post as five reasons we use Dask on HPC and five things that could be better, what would be those five things? I think it'd be good to get a five-item list from a few people cc'ed above, then maybe we talk about those lists and I (or anyone else if interested) composes an initial draft that we can then all iterate on?

kmpaul commented 5 years ago

@jhamman can probably amend or append to this...

5 Reasons We Use Dask at NCAR

In my mind, these are the 5 best things that we get from Dask, here at NCAR:

Cost: Dask is free and open source, which means we at NCAR do not have to rebalance our budget and staff to address the new immediate need of data analysis tools that we have failed to invest in over the last decade as Big Data problems have reared their ugly head.
Scalability: Dask provides the scalability we need to deal with the massively large datasets that our existing (primarily serial) data analysis tools cannot.
Simplicity: By abstracting the parallelism away from the user/developer, our analysis tools can be written by computer science non-experts, such as the scientists themselves, meaning that our software engineers can take on a more supporting role than a leadership role.
Inspiration: Dask is more than just a tool to NCAR; it is a gateway to thinking about a completely different way of providing computing infrastructure to our users. Dask opens up the door to cloud computing technologies (such as elastic scaling and object storage) and makes us rethink what an HPC center should really look like!
Collaboration: Dask, and members of the Dask development team, have been critical collaborators with NCAR staff, helping us learn a new way of conducting business in the open source software world.

3 Things That Could Be Better

I probably need some help developing this list, as this list is short and very self-centered. In fact, some of these things could actually already have solutions...which I would be happy to hear about.

Batch Launching: For many at NCAR, launching a parallel job means 2 things: (1) writing a PBS or SLURM batch script that (2) runs your parallel application with mpirun or mpiexec. While Dask is fantastic for interactive jobs with Jupyter Lab/Notebooks, a great deal of our workflows at NCAR are still (and probably will be for the foreseeable future) chained batch-job workflows. To fit into those workflows (without extra modifications), a single batch job must be submitted that launches the Scheduler, Workers, and the Client (and application, obviously) together.
Coarse-Grained Diagnostics: Dask provides a number of profiling tools that give you diagnostics at the individual task-level, but (as far as I know), there is no way of analyzing or profiling your Dask application at the (for lack of a better term) "unchunked" task-level (or the "operator-level", if that makes sense). The best way I know of getting coarse-grained diagnostics is to scale your application down to a much smaller size so that you can run it serially. However, some applications just don't show problems, or at least do not indicate the truly problematic step, when scaled back.
Scheduler Profiling: When large graphs get created, there can be long delays before any sign of execution shows up on the dashboard. I can only assume that this is because of work that the Scheduler is doing that is not being displayed. It would be nice to get some information about what the Scheduler is doing at these moments. Can this be done today?

...@jhamman, is there anything you would add?

mrocklin commented 5 years ago

Thanks for getting this started @kmpaul !

kmpaul commented 5 years ago

Oh, and I just was reminded of another "Thing That Could Be Better"...

Non-Dashboard Diagnostics: Is there a way to get the same information displayed on the dashboard dumped to a file(s) that can be processed after the job? That would be really useful.

guillaumeeb commented 5 years ago

My contribution, just some quick thoughts:

Reasons we use Dask at CNES

Ease of use: extension of Numpy or Pandas, well known API for scientists and engineers, or other simple ones for doing multinode python multiprocessing.
Versatility of APIs: Arrays, dataframes, bag, delayed, futures, everthing is possible. More accessible, and more possibilities than with Spark.
Smooth HPC integration with dask-jobqueue (and dask-mpi): no need of any boilerplate bash code with qsub, interactive analysis at scale and auto scaling is just awesome.
Infrastructure agnostic: compatible with laptop, server, HPC cluster, and even Cloud computing if needed, environment may change with very little code adaptations, and great scalability.
Aimed for scientific processing, compatible with several scientific data format (thanks to Xarray and other libraries): NetCDf, Parquet, remote sensing incomming with RasterIO... Not only text analysis, Integrated into Scientific Python stack

Things that could be better

Heterogeneous resources handling: workers with low or high memory, workers with GPUS, all in the same Dask cluster, multiple node pools... See https://github.com/dask/distributed/issues/2118
Scheduler in a job: Separating the scheduler from the notebook/ipython console/user process. See https://github.com/dask/distributed/issues/2235
Link with Deep Learning: how to feed data retrieved and organized by Dask into DL algorithm, interface between Dask and TensorFlow/PyTorch, could we launch Distributed TensorFlow from Dask?
Task and submission history (maybe this is already possible?), one feature of Spark that Dask miss for processing post analysis.
Remote sensing data (Jpeg2000, GeoTiff) integration to improve, and probably other scientific format (Fits ...)? But not directly related do Dask.

Also 👍 to Scheduler Profiling from @kmpaul, ~~but 👎 on Batch Launching that is already covered by Dask IMO~~

guillaumeeb commented 5 years ago

And I'm also very interested in contributing into the blog post.

There may also be ideas to take in @jhamman post that you already relayed on dask-blog: https://blog.dask.org/2018/10/08/Dask-Jobqueue.

kmpaul commented 5 years ago

@guillaumeeb I'd be interested to hear how you solve the Batch Launching problem. And, if you feel it is a solved problem, I'm obviously happy to take it off the list (and learn something myself!).

guillaumeeb commented 5 years ago

Maybe I'm missunderstanding your point, but isn't dask-mpi just there for batch launching dask applications? Happy to discuss on gitter about this in order to not pollute this issue.

kmpaul commented 5 years ago

Sure. Let's move to gitter.

guillaumeeb commented 5 years ago

@kmpaul just convinced me that dask-mpi was not yet sufficient to implement correctly batch launching, so in the end 👍 to improved batch launching too!

mrocklin commented 5 years ago

I'm happy to start writing up a skeleton draft of this if people are interested.

Alternatively, it would still be good to get thoughts from others. I wonder if now that AGU is over folks like @rabernat or @jhamman have time. I'm also interested in thoughts from non-earth-scientists like @lesteve and @ogrisel if they have time to list some general thoughts.

guillaumeeb commented 5 years ago

ping also @willirath and @apatlpo.

dharhas commented 5 years ago

Apologies on the delayed response. @guillaumeeb list is actually pretty spot on. Very interested on the heterogeneous resource launching and improved batch launching.

willirath commented 5 years ago

Reasons we use Dask at GEOMAR

(In addition to virtually all of the above)

Easy to teach: When training people in using Dask, it's very easy to expose them to exactly the fraction of the API that is necessary for the task at hand.

Things that could be better

Heterogeneous clusters: Both, making them easier to launch and having a simple way of associating built-in methods of Dask arrays with resources would be great.

mrocklin commented 5 years ago

I've added a quick draft here: https://github.com/dask/dask-blog/pull/6

Help filling things in there would be welcome.

dask / dask-blog