flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.74k stars 652 forks source link

[Backend][Plugin]Support for Dask clustered tasks in Flyte #427

Closed kumare3 closed 1 year ago

kumare3 commented 4 years ago

Why would this plugin be helpful to the Flyte community Users could write very short running distributed array jobs using DASK. This makes it possible to have very small runtime jobs multi-plexed onto same set of nodes.

Type of Plugin

Can you help us with the implementation?

Additional context This would really help express some ideas that are not Spark, or heavyweight like Flyte batch jobs.

kumare3 commented 2 years ago

Dask can be deployed to Kubernetes, the template is shown here. Allowing this would help the users a lot and enable writing really short tasks. This coupled with cluster re-use (coming later) or cluster gateways (daskhub) and support for a coiled task in the future would enable users to use dask more effectively and make Flyte + Dask work together.

@task(config=Dask(
            workers=4,
            worker_resources=....,
            [worker-pod-template=...] # Also the command should probably be hard coded client side?
            ), resources=Resources(....) # Driver resource
)
def my_dask_program():
      pass
ossareh commented 2 years ago

This was discussed a little on slack: https://app.slack.com/client/TN89P6GGK/CNMKCU6FR/thread/CNMKCU6FR-1648660418.322249

Currently we do not use dask, nor do we use Coiled's hosted platform for Dask. However, both are really interesting to us in terms of migrating away from our current, home rolled, workflow orchestration solution and having someone else run our work loads for us.

The primary interest in Dask is its drop in nature w.r.t. dataframes and numpy arrays. We currently employ a streaming solution (using mmap) to allow us to do this work on one node; being able to scale up to multiple nodes without needing code changes in dependent areas of our project would be an instant win for us.

From an integration point of view I found the Dask+Prefect video informative:

In this video a couple of things stand out to me as desirable from any integration:

Unfortunately I don't have much more to add other than opinions currently; however as we re-work our stack to work with flyte there's a high chance we find ourselves going down this path - in which case we'll keep you apprised.

kumare3 commented 2 years ago

This is a great summary, let me chalk out the effort and see when we can accommodate this

kumare3 commented 2 years ago

Also we can always start with a flytekit plugin, Which should be simple

bstadlbauer commented 2 years ago

@kumare3 Quick update on this, we are working on a flytekit plugin for this and have a working prototype. Will test it out in the upcoming week(s) and if we see it's working we will open a PR to add this.

I've also looked into creating a backend plugin and have a working prototype, capable of managing the cluster lifecycle. Currently, this is waiting on https://github.com/dask/dask-kubernetes/issues/483 (Basically job CRD, similar to the SparkApplication CRD).

kumare3 commented 2 years ago

@bstadlbauer this is awesome. Please let us know how we can help. There is some momentum now in adding Flyte+Ray support. We will also be working on reusing Ray cluster across multiple tasks in a Flyte workflow. Once you have your dask plugin, we will start modifying things towards this common way of reusing clusters

bstadlbauer commented 2 years ago

@kumare3 Great, thank you! Resuing clusters would be super helpful!

I've looked at the spark plugin as an example, and wanted to confirm something: As far as I understood it, the plugin will watch one K8s object, requiring that this object also has to include the client code (e.g. a pod interacting with the cluster), right? So for the spark case, this would be the SparkApplication CRD, which submits the spark job. Asking because at the moment, the dask operator only offers to setup the cluster (scheduler+workers) through a CRD, but does not run any user code yet. This will be added in https://github.com/dask/dask-kubernetes/issues/483, but that would mean the Flyte backend plugin will have to wait on that to be ready, right?

kumare3 commented 2 years ago

@bstadlbauer tye backend plugin is flexible. Spark is peculiar because it starts the cluster and runs the app. We actually prefer that you can run a separate driver as that can speed up Flyte even more and give fantastic control- learnt through many issues in spark.

Flyte can run the user code as a separate pod and then monitor it. This also helps on reuse

bstadlbauer commented 2 years ago

@kumare3 Oh that's nice! Is there a plugin that does this already?

kumare3 commented 2 years ago

Not today, but we are working on ray plugin. Let me add you to a slack thread

bstadlbauer commented 2 years ago

Quick update:

bstadlbauer commented 1 year ago

Quick update from my end. I had some time this weekend to finish things. Sorry for this taking so long, the last weeks have been quite busy.

Overall, this would be the order in which the PRs need to go in:

  1. https://github.com/flyteorg/flyteidl/pull/339
  2. https://github.com/flyteorg/flyteplugins/pull/275
  3. https://github.com/flyteorg/flytepropeller/pull/508
  4. https://github.com/flyteorg/flytekit/pull/1366
  5. https://github.com/flyteorg/flyte/pull/3145
  6. https://github.com/flyteorg/flytesnacks/pull/929

@kumare3 @hamersaw

cosmicBboy commented 1 year ago

All PRs are in, closing this task. Thanks again for this awesome contribution @bstadlbauer ! 🚀