Closed kumare3 closed 1 year ago
Dask can be deployed to Kubernetes, the template is shown here. Allowing this would help the users a lot and enable writing really short tasks. This coupled with cluster re-use (coming later) or cluster gateways (daskhub) and support for a coiled task in the future would enable users to use dask more effectively and make Flyte + Dask work together.
@task(config=Dask(
workers=4,
worker_resources=....,
[worker-pod-template=...] # Also the command should probably be hard coded client side?
), resources=Resources(....) # Driver resource
)
def my_dask_program():
pass
This was discussed a little on slack: https://app.slack.com/client/TN89P6GGK/CNMKCU6FR/thread/CNMKCU6FR-1648660418.322249
Currently we do not use dask, nor do we use Coiled's hosted platform for Dask. However, both are really interesting to us in terms of migrating away from our current, home rolled, workflow orchestration solution and having someone else run our work loads for us.
The primary interest in Dask is its drop in nature w.r.t. dataframes and numpy arrays. We currently employ a streaming solution (using mmap) to allow us to do this work on one node; being able to scale up to multiple nodes without needing code changes in dependent areas of our project would be an instant win for us.
From an integration point of view I found the Dask+Prefect video informative:
In this video a couple of things stand out to me as desirable from any integration:
Unfortunately I don't have much more to add other than opinions currently; however as we re-work our stack to work with flyte there's a high chance we find ourselves going down this path - in which case we'll keep you apprised.
This is a great summary, let me chalk out the effort and see when we can accommodate this
Also we can always start with a flytekit plugin, Which should be simple
@kumare3 Quick update on this, we are working on a flytekit
plugin for this and have a working prototype. Will test it out in the upcoming week(s) and if we see it's working we will open a PR to add this.
I've also looked into creating a backend plugin and have a working prototype, capable of managing the cluster lifecycle. Currently, this is waiting on https://github.com/dask/dask-kubernetes/issues/483 (Basically job
CRD, similar to the SparkApplication
CRD).
@bstadlbauer this is awesome. Please let us know how we can help. There is some momentum now in adding Flyte+Ray support. We will also be working on reusing Ray cluster across multiple tasks in a Flyte workflow. Once you have your dask plugin, we will start modifying things towards this common way of reusing clusters
@kumare3 Great, thank you! Resuing clusters would be super helpful!
I've looked at the spark
plugin as an example, and wanted to confirm something:
As far as I understood it, the plugin will watch one K8s object, requiring that this object also has to include the client code (e.g. a pod interacting with the cluster), right? So for the spark case, this would be the SparkApplication
CRD, which submits the spark job.
Asking because at the moment, the dask
operator only offers to setup the cluster (scheduler+workers) through a CRD, but does not run any user code yet. This will be added in https://github.com/dask/dask-kubernetes/issues/483, but that would mean the Flyte backend plugin will have to wait on that to be ready, right?
@bstadlbauer tye backend plugin is flexible. Spark is peculiar because it starts the cluster and runs the app. We actually prefer that you can run a separate driver as that can speed up Flyte even more and give fantastic control- learnt through many issues in spark.
Flyte can run the user code as a separate pod and then monitor it. This also helps on reuse
@kumare3 Oh that's nice! Is there a plugin that does this already?
Not today, but we are working on ray plugin. Let me add you to a slack thread
Quick update:
DaskJob
resource (similar to the SparkApplication
resource) to deploy the cluster and run the codeQuick update from my end. I had some time this weekend to finish things. Sorry for this taking so long, the last weeks have been quite busy.
Overall, this would be the order in which the PRs need to go in:
@kumare3 @hamersaw
All PRs are in, closing this task. Thanks again for this awesome contribution @bstadlbauer ! 🚀
Why would this plugin be helpful to the Flyte community Users could write very short running distributed array jobs using DASK. This makes it possible to have very small runtime jobs multi-plexed onto same set of nodes.
Type of Plugin
Can you help us with the implementation?
Additional context This would really help express some ideas that are not Spark, or heavyweight like Flyte batch jobs.