Dask notes to include - Githubissues

Good Dask notes from @adbreind that I would like to include in this tutorial:

About Dask

Dask was created in 2014 as part of the Blaze project, a DARPA funded project at Continuum/Anaconda. It has since grown into a multi-institution community project with developers from projects including NumPy, Pandas, Jupyter and Scikit-Learn. Many of the core Dask maintainers are employed to work on the project by companies including Continuum/Anaconda, Prefect, NVIDIA, Capital One, Saturn Cloud and Coiled.

Fundamentally, Dask allows a variety of parallel workflows using existing Python constructs, patterns, or libraries, including dataframes, arrays (scaling out Numpy), bags (an unordered collection construct a bit like Counter), and concurrent.futures

In addition to working in conjunction with Python ecosystem tools, Dask's extremely low scheduling overhead (nanoseconds in some cases) allows it work well even on single machines, and smoothly scale up.

Dask supports a variety of use cases for industry and research: https://stories.dask.org/en/latest/

With its recent 2.x releases, and integration to other projects (e.g., RAPIDS for GPU computation), many commercial enterprises are paying attention and jumping in to parallel Python with Dask.

Dask Ecosystem

In addition to the core Dask library and its Distributed scheduler, the Dask ecosystem connects several additional initiatives, including...

Dask ML - parallel machine learning, with a scikit-learn-style API
Dask-kubernetes
Dask-XGBoost
Dask-YARN
Dask-image
Dask-cuDF
... and some others

What's Not Part of Dask?

There are lots of functions that integrate to Dask, but are not represented in the core Dask ecosystem, including...

a SQL engine
data storage
data catalog
visualization
coarse-grained scheduling / orchestration
streaming

... although there are typically other Python packages that fill these needs (e.g., Kartothek or Intake for a data catalog).

How Do We Set Up and/or Deploy Dask?

The easiest way to install Dask is with Anaconda: conda install dask

Schedulers and Clustering

Dask has a simple default scheduler called the "single machine scheduler" -- this is the scheduler that's used if your import dask and start running code without explicitly using a Client object. It can be handy for quick-and-dirty testing, but I would (warning! opinion!) suggest that a best practice is to use the newer "distributed scheduler" even for single-machine workloads

The distributed scheduler can work with

threads (although that is often not a great idea due to the GIL) in one process
multiple processes on one machine
multiple processes on multiple machines

The distributed scheduler has additional useful features including data locality awareness and realtime graphical dashboards.

coiled / data-science-at-scale

Dask notes to include #14

About Dask

How Do We Set Up and/or Deploy Dask?