coiled / data-science-at-scale

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).
MIT License
112 stars 38 forks source link

Dask notes to include #14

Closed hugobowne closed 4 years ago

hugobowne commented 4 years ago

Good Dask notes from @adbreind that I would like to include in this tutorial:

About Dask

Dask was created in 2014 as part of the Blaze project, a DARPA funded project at Continuum/Anaconda. It has since grown into a multi-institution community project with developers from projects including NumPy, Pandas, Jupyter and Scikit-Learn. Many of the core Dask maintainers are employed to work on the project by companies including Continuum/Anaconda, Prefect, NVIDIA, Capital One, Saturn Cloud and Coiled.

Fundamentally, Dask allows a variety of parallel workflows using existing Python constructs, patterns, or libraries, including dataframes, arrays (scaling out Numpy), bags (an unordered collection construct a bit like Counter), and concurrent.futures

In addition to working in conjunction with Python ecosystem tools, Dask's extremely low scheduling overhead (nanoseconds in some cases) allows it work well even on single machines, and smoothly scale up.

Dask supports a variety of use cases for industry and research: https://stories.dask.org/en/latest/

With its recent 2.x releases, and integration to other projects (e.g., RAPIDS for GPU computation), many commercial enterprises are paying attention and jumping in to parallel Python with Dask.

Dask Ecosystem

In addition to the core Dask library and its Distributed scheduler, the Dask ecosystem connects several additional initiatives, including...

What's Not Part of Dask?

There are lots of functions that integrate to Dask, but are not represented in the core Dask ecosystem, including...

... although there are typically other Python packages that fill these needs (e.g., Kartothek or Intake for a data catalog).

How Do We Set Up and/or Deploy Dask?

The easiest way to install Dask is with Anaconda: conda install dask

Schedulers and Clustering

Dask has a simple default scheduler called the "single machine scheduler" -- this is the scheduler that's used if your import dask and start running code without explicitly using a Client object. It can be handy for quick-and-dirty testing, but I would (warning! opinion!) suggest that a best practice is to use the newer "distributed scheduler" even for single-machine workloads

The distributed scheduler can work with

The distributed scheduler has additional useful features including data locality awareness and realtime graphical dashboards.

hugobowne commented 4 years ago

mostly done