A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).
Good Dask notes from @adbreind that I would like to include in this tutorial:
About Dask
Dask was created in 2014 as part of the Blaze project, a DARPA funded project at Continuum/Anaconda. It has since grown into a multi-institution community project with developers from projects including NumPy, Pandas, Jupyter and Scikit-Learn. Many of the core Dask maintainers are employed to work on the project by companies including Continuum/Anaconda, Prefect, NVIDIA, Capital One, Saturn Cloud and Coiled.
Fundamentally, Dask allows a variety of parallel workflows using existing Python constructs, patterns, or libraries, including dataframes, arrays (scaling out Numpy), bags (an unordered collection construct a bit like Counter), and concurrent.futures
In addition to working in conjunction with Python ecosystem tools, Dask's extremely low scheduling overhead (nanoseconds in some cases) allows it work well even on single machines, and smoothly scale up.
With its recent 2.x releases, and integration to other projects (e.g., RAPIDS for GPU computation), many commercial enterprises are paying attention and jumping in to parallel Python with Dask.
Dask Ecosystem
In addition to the core Dask library and its Distributed scheduler, the Dask ecosystem connects several additional initiatives, including...
Dask ML - parallel machine learning, with a scikit-learn-style API
Dask-kubernetes
Dask-XGBoost
Dask-YARN
Dask-image
Dask-cuDF
... and some others
What's Not Part of Dask?
There are lots of functions that integrate to Dask, but are not represented in the core Dask ecosystem, including...
a SQL engine
data storage
data catalog
visualization
coarse-grained scheduling / orchestration
streaming
... although there are typically other Python packages that fill these needs (e.g., Kartothek or Intake for a data catalog).
How Do We Set Up and/or Deploy Dask?
The easiest way to install Dask is with Anaconda: conda install dask
Schedulers and Clustering
Dask has a simple default scheduler called the "single machine scheduler" -- this is the scheduler that's used if your import dask and start running code without explicitly using a Client object. It can be handy for quick-and-dirty testing, but I would (warning! opinion!) suggest that a best practice is to use the newer "distributed scheduler" even for single-machine workloads
The distributed scheduler can work with
threads (although that is often not a great idea due to the GIL) in one process
multiple processes on one machine
multiple processes on multiple machines
The distributed scheduler has additional useful features including data locality awareness and realtime graphical dashboards.
Good Dask notes from @adbreind that I would like to include in this tutorial:
About Dask
Dask was created in 2014 as part of the Blaze project, a DARPA funded project at Continuum/Anaconda. It has since grown into a multi-institution community project with developers from projects including NumPy, Pandas, Jupyter and Scikit-Learn. Many of the core Dask maintainers are employed to work on the project by companies including Continuum/Anaconda, Prefect, NVIDIA, Capital One, Saturn Cloud and Coiled.
Fundamentally, Dask allows a variety of parallel workflows using existing Python constructs, patterns, or libraries, including dataframes, arrays (scaling out Numpy), bags (an unordered collection construct a bit like
Counter
), andconcurrent.futures
In addition to working in conjunction with Python ecosystem tools, Dask's extremely low scheduling overhead (nanoseconds in some cases) allows it work well even on single machines, and smoothly scale up.
Dask supports a variety of use cases for industry and research: https://stories.dask.org/en/latest/
With its recent 2.x releases, and integration to other projects (e.g., RAPIDS for GPU computation), many commercial enterprises are paying attention and jumping in to parallel Python with Dask.
Dask Ecosystem
In addition to the core Dask library and its Distributed scheduler, the Dask ecosystem connects several additional initiatives, including...
What's Not Part of Dask?
There are lots of functions that integrate to Dask, but are not represented in the core Dask ecosystem, including...
... although there are typically other Python packages that fill these needs (e.g., Kartothek or Intake for a data catalog).
How Do We Set Up and/or Deploy Dask?
The easiest way to install Dask is with Anaconda:
conda install dask
Schedulers and Clustering
Dask has a simple default scheduler called the "single machine scheduler" -- this is the scheduler that's used if your
import dask
and start running code without explicitly using aClient
object. It can be handy for quick-and-dirty testing, but I would (warning! opinion!) suggest that a best practice is to use the newer "distributed scheduler" even for single-machine workloadsThe distributed scheduler can work with
The distributed scheduler has additional useful features including data locality awareness and realtime graphical dashboards.