flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.41k stars 580 forks source link

Add support for Multiple Architecture types #1539

Open akhurana001 opened 2 years ago

akhurana001 commented 2 years ago

Motivation: Why do you think this is important?

Multiple cloud providers (AWS etc.) have extensive support for ARM based instances now and we want to explore running Flyte workloads on ARM instances due to multiple benefits over x86. In general, the goal is to enable platform admins to configure multiple architecture types as well as Flyte users to select between them based on which one they want to use/image support etc.

Goal: What should the final outcome look like, ideally? Platform admins be able to configure multiple architecture support within the same flyte installation and users being able to specify which one to execute on.

[Optional] Propose: Link/Inline OR Additional context The goal is to enable Platform admins to maintain both ARM and x86 ASGs on each Kubernetes cluster and to intelligently route workloads to the right ASG based on which architecture the user image was built/tested/registered for.

This can be done with adding taints to the ASGs and tolerations/node-selector to pods something Flyte already does for Interruptible enabled Tasks. However, for this use-case, it might be easier for users to tag the whole repo/sha with ARM rather than tagging individual tasks/nodes.

I was thinking we can re-use labels support at Launch-Plan level to implement this: Users can set architecture label either at project level via flyte-config or directly at a launch-plan level which will then be routed correctly to the Flyte backend for those workflow versions/sha and corresponding tolerations and node-selector be added to those pods by FlytePropeller/plugins (similar to how Interruptible works).

Let me know what your thoughts are on this.

kumare3 commented 2 years ago

Quick comment, it's flytekit tested on arm? Also let's not as anything to flytekit config, but to launchplan or workflow spec.

akhurana001 commented 2 years ago

Quick comment, it's flytekit tested on arm? Also let's not as anything to flytekit config, but to launchplan or workflow spec.

Haven't tested extensively but I did run a basic spark/python job on ARM (flytekit=0.16) and it worked.

akhurana001 commented 2 years ago

Synced-up with Haytham. Some notes:

  1. The architecture field is closely related to the image/version and is probably a better fit as part of Container IDL . It's better to have this strongly typed here as opposed to labels on launch-plans (thou overrides probably would still be label based)
  2. For UX, flyte-config is probably a good solution to default at project level.
  3. We would probably need to support overrides at launch-plan levels for larger user/projects. Ideally this should be strongly typed but label based override is ok.

For backend implementation, two potential options are:

  1. Letting FlytePropeller/FlytePlugins add tolerations/node-selector based on the architecture field similar to how interruptible works today.
  2. Let FlytePropeller/Plugins add a label to each pod and a mutating webhook handle adding tolerations/node-selector. This can even work with plugins which don't have tolerations/node-selector support exposed in the CRD.
akhurana001 commented 2 years ago

Hi @kumare3 @EngHabu @anandswaminathan , based on our discussions:

I am leaning towards adding an Explicit Architecture field in Container IDL and letting FlytePropeller/FlytePlugins to add tolerations/node-selector based on the architecture field similar to how interruptible works today.

For the UX, this field can be defaulted by users as part of register command (pyflyte register) which is preferable over the flyte-config default approach (@kumare3 ?) . Additionally, users can override this at a task level if need be for any container task.

We can also add a mutating webhook but that seems like an overkill since even for the webhook, FlytePropeller/Plugins will need to understand the arch-field and add a label to be consumed by the webhook and if plugins own this part, I rather have them add the tolerations/node-selector directly (@EngHabu ?)

Let me know if there are any concerns here else I will start on the changes.

EngHabu commented 2 years ago

I think that's a great start.. my main concern with plugins doing that (also applies to interruptible btw) is that plugins that create other CRDs that then create Pods (e.g. Distributed PyTorch) do not expose the full PodSpec but they do expose labels and it seems to be a common convention that created pods get a copy of the CRD labels (if not annotations as well)... hence why I think the WebHook model works better...

That said, I think it's ok to start this way and move to something else later on... as long as we are happy with the Idl representation and UX..

The one thing that's not clear to me from the summary is what will go into that architecture field in Container IDL? is it a single architecture or all supported architectures?

message Container {
   // An ordered list of architectures supported by this container image. The plugins that will run this container will attempt to match this list against supported machine types in the cluster and pick the first available one.
   repeated string architecture = X;

Is that what you meant? or did you mean they will just set a single architecture and the task fails if that architecture isn't available in the cluster? (they can still do that even if it's a list by setting a single item there)...

akhurana001 commented 2 years ago

Is that what you meant? or did you mean they will just set a single architecture and the task fails if that architecture isn't available in the cluster? (they can still do that even if it's a list by setting a single item there)...

I was thinking of an enum i.e. users only specify a single architecture and fail if its not present.

If we let them specify multiple, then it gets more complex into how does the Platform decide on one and what happens if the workflows fail on one but maybe succeed on another so I would prefer if we let users choose one and use that. Users would need to work with the Platform admin to understand which architectures are supported and choose.

github-actions[bot] commented 1 year ago

Hello πŸ‘‹, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! πŸ™

github-actions[bot] commented 1 year ago

Hello πŸ‘‹, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! πŸ™

github-actions[bot] commented 1 month ago

Hello πŸ‘‹, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable. Thank you for your contribution and understanding! πŸ™