PrefectHQ / prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
https://prefect.io
Apache License 2.0
15.79k stars 1.55k forks source link

Allow `cluster_kwargs` to be provided via config #5659

Open multimeric opened 2 years ago

multimeric commented 2 years ago

Current behavior

Currently it is not possible to configure the dask cluster via the config file

Proposed behavior

We should be able to configure it in this manner:

[engine.executor.dask]
cluster_class = "dask_jobqueue.SLURMCluster"
cluster_kwargs = {cores = 1, memory = "1 GB"}

Example

This would allow more of the technicalities of execution, which are mostly an environment-specific concern, to be abstracted out from the flow definition.

For example I want to be able to define a flow once, and for it to run locally using the LocalDaskExecutor, but also on HPC using the DaskExecutor + SLURMCluster, solely using configuration. However at present this isn't really possible since we can't properly configure the cluster without actually changing the flow itself.

Context

Note how we allow the cluster class to be defined in the config, but not its kwargs: https://github.com/PrefectHQ/prefect/blob/40e2e2bd1c1de5651bc7fb484c823e43cfc69877/src/prefect/executors/dask.py#L166-L173

zanieb commented 2 years ago

Note this is already possible since we support use of an arbitrary callable as the cluster_class. For example, you can set the cluster class to a callable that inspects the config/env for the settings you want then returns an instance of a cluster class. Does that make sense?

Since we do allow configuration of the cluster class via the config, I'm leaning towards allowing the kwargs to be set there as well. My only concern is that the supported datatypes will be limited.

multimeric commented 2 years ago

Note this is already possible since we support use of an arbitrary callable as the cluster_class. For example, you can set the cluster class to a callable that inspects the config/env for the settings you want then returns an instance of a cluster class. Does that make sense?

This is a neat workaround. I guess you mean that I should first make a class that looks up cluster_kwargs (or whatever I choose to call it), and then set the cluster_class in the config to a string containing the full module path to that class. I prefer this, because it saves me having to instantiate the executor myself (which I want to avoid).

Since we do allow configuration of the cluster class via the config, I'm leaning towards allowing the kwargs to be set there as well. My only concern is that the supported datatypes will be limited.

This would be great! Of course TOML can't support everything, but this happens to be one of the classes where almost all the args are primitive typed anyway.

multimeric commented 2 years ago

It kind of feels like there should be a blanket system for translating config into keyword args for any of the nested orchestration classes like the executor, cluster etc. But perhaps that's too ambitious.

zanieb commented 2 years ago

Yeah basically. You can set the cluster class to a callable that returns a cluster, it doesn't actually need to be a class

def dynamic_cluster():
   # lookup cluster class and cluster kwargs from anywhere e.g.
   cluster_class = import_object(os.environ["CLUSTER_CLASS"])
   kwargs = ...
   return cluster_class(**kwargs)

It kind of feels like there should be a blanket system for translating config into keyword args for any of the nested orchestration classes like the executor, cluster etc. But perhaps that's too ambitious.

Yeah, this sounds nice but ends up creating confusing expectations for users on merging / precedence. Perhaps something we can consider addressing in v2 since there aren't strong expectations yet.