dask / dask-kubernetes

Native Kubernetes integration for Dask
https://kubernetes.dask.org
BSD 3-Clause "New" or "Revised" License
312 stars 148 forks source link

Improve advanced KubeCluster customization API #899

Open jacobtomlinson opened 3 months ago

jacobtomlinson commented 3 months ago

Background

When creating a new KubeCluster the __init__()/_start() process happens in a couple of steps.

  1. Generate a dictionary of the Kubernetes spec from provided kwargs using make_cluster_spec()
  2. Create a DaskCluster class (which was created by kr8s.asyncio.objects.make_class()) from that dictionary
  3. Call DaskCluster.create()

If users want to customize their cluster spec before creating it they can use make_cluster_spec() directly which takes the same kwargs as KubeCluster and returns the dictionary that is generated in step 1, then they can modify the dict before passing it directly to step 2 via the kwarg KubeCluster(custom_cluster_spec=...) .

Problem statement

When it comes to adding new kwargs to make_cluster_spec() we are very conservative because it is easy for that method to become extremely overwhelming. However as highlighted in #898 there may be customizations we want to do regularly like setting the memory limit on the dask worker CLI args which is not very ergonomic to do by modifying the dict directly. It would be nice of the object returned from make_cluster_spec() was easier to customize than a simple dictionary.

For example

from dask_kubernetes.operator import KubeCluster, make_cluster_spec

cluster_spec = make_cluster_spec("foo", ...)

# Currently we do this
cluster_spec["spec"]["worker"]["spec"]["containers"][0]["args"] += ["--memory-limit", "4GB"]
# It would be nicer if we could do something like this instead
# cluster_spec.set_worker_memory_limit("4GB")

cluster = KubeCluster(custom_cluster_spec=cluster_spec)

Option 1

Instead of returning a dict from make_cluster_spec() we could return the DaskCluster object from step 2. This object is dict-like so any existing code that modifies the dict should work as expected (although this will need to be verified).

Pros

The existing DaskCluster class is a convenient place to add methods to simplify some of the common customizations. It will be created anyway, we would just be changing the step at which the user can intervene from 1 to 2.

Cons

The DaskCluster object that would be returned by make_cluster_spec() would be an async kr8s object and potentially KubeCluster is being used in a sync context, and we would then be extending it with sync methods to modify things.

Returning the DaskCluster object leaks the implementation detail that we are using kr8s to interact with Kubernetes.

It also means in theory a user could call await cluster_spec.create() which would create the Dask cluster, but it wouldn't handle any of the port forwarding, log retrieval, etc and couldn't be passed to dask.distributed.Client.

Having KubeCluster and DaskCluster, which serve different purposes but have similar names, may cause confusion with users.

Option 2

We could introduce some new intermediate object with a name like KubeClusterSpec which would effectively be a subclass of dict but with whatever convenience methods we want. We would return this from make_cluster_spec().

We would need to ensure that DaskCluster can be passed a KubeClusterSpec instead of a dict. There is already support in kr8s for passing a few different things, so we should be able to just conform to one of those.

Pros

Adding a new class would avoid any user confusion as users would continue to never interact with kr8s resources directly.

The purpose of the new class is well scoped, it is a dictionary of the spec with some utility methods to simplify modifying the spec.

Cons

Introducing a new class into the chain increases complexity.

droctothorpe commented 3 months ago

As a single data point, I prefer option 2. Internally, we have a DaskClusterConfig class with a spec method that just returns a dictionary in the format that custom_cluster_spec expects.

This is nice because it lets you configure your cluster in an object oriented way instead of manipulating a complex, nested dictionary. It provides sensible defaults but you can easily override them. Everything is a property so you can instantiate first then set specific properties later. It pretty prints your config and even creates an interactive ipywidget when executed within a Jupyter notebook. It was inspired by Dask Gateway's Options class.

jacobtomlinson commented 3 months ago

Yeah I think I'm leaning that way too.