High-level plan and scope

This issue outlines the high-level plan and scope of dask-databricks.

Overview

Running Dask clusters on Databricks is possible today but is not well supported. The goal of dask-databricks is to provide folks with useful tools to make launching and using Dask clusters on Databricks easier.

Specifically this tool is intended to launch a Dask cluster alongside a Databricks multi-node Spark cluster. Spark's architecture is similar to Dask in that it has a driver node that coordinates work and many worker nodes that run the work. When using Databricks notebooks the cells of the notebook are executed on the driver node. When using pyspark or other libraries the tasks are submitted to the driver which distributes them onto the workers.

Via init scripts it is possible to run arbitrary commands on each node as it starts up, and node are aware of their role and the network information of the driver node via environment variables. Therefore we can use an init script to start up Dask scheduler and worker processes alongside the Spark driver and worker processes.

From the user's perspective in a notebook environment they can then choose to either use pyspark or dask and utilize the same set of hardware.

Outline

There are two key technical aspects to using Dask on Databricks, 1) starting the cluster and 2) using the cluster. These functions will be provided by two separate parts of dask-databricks.

Cluster startup tool

To simplify launching the Dask cluster components a CLI tool will be provided that simplifies starting the Dask cluster. Users should be able to create a minimal init script with something like this.

#!/bin/bash

pip install dask-databricks
dask databricks run

It would be great if that's all that is needed to launch the Dask cluster.

Cluster manager

Then from the notebook side it should be easy for the user to create a Dask Client and connect to the cluster.

It could be nice to implement a DatabricksCluster object that provides convenience methods to the user. This would be similar to the HelmCluster in Dask Kubernetes where unlike most cluster managers it doesn't create the cluster, but instead finds and connects to the existing cluster and then provides the Cluster API.

from dask_databricks import DatabricksCluster
from dask.distributed import Client

# If `dask databricks run` was successful this will connect to the scheduler, 
# otherwise will raise a useful error to help folks troubleshoot their cluster
cluster = DatabricksCluster() 

# Connect the client
client = Client(cluster)

# Easily view logs from scheduler and workers in the notebook
cluster.get_logs()

dask-contrib / dask-databricks