Netflix / metaflow

Open Source Platform for developing, scaling and deploying serious ML, AI, and data science systems
https://metaflow.org
Apache License 2.0
8.09k stars 762 forks source link

[argo] support for @parallel #1927

Closed valayDave closed 2 months ago

valayDave commented 2 months ago

supported :

  • foreach + parallel
  • parallel with Argo
  • dynamically set worker-counts.
  • should work with @timeout / @project /@card etc.
  • retries working with native Argo
  • fully self contained jobset with argo support - Requires Jobset v0.6.0 [kubernetes-sigs/jobset#523]

not-supported:

  • support for catch

Notes

  • not using the {{retries}} like we do in container templates
  • Instead passing down {{retries}} as a inputs.parameters which will be accessible in the Jobset manifest.
  • Temporary tweek to boto dep to ensure that boto install failures dont fail deployment.
  • instead of relying on the kubernetes object, we freshly create a object in the ArgoContainer templates.
  • Code in the same style as the kubernetes/argo integrations with explicit filling of variables and decoupled abstractions
  • setting annotations explicitly as they wont be passed down from WorkflowTemplate level.
  • support for jobset native success conditions (requires Jobset v0.6 on controller)
  • REFACTORS THAT HAVE WENT INTO THIS COMMIT:
  • [argo][feedback] refactor dag template parameter /output setting - just move conditional block around
  • [argo][feedback] refactor references to task_id_base to task_id_entropy
  • these are set/used in the argo outputs and variable names
  • [argo][feedback] refactor references to task-id-base to task-id-entropy
  • these are uses a Argo Parameter Names.
  • [argo][feedback] refactor to match code style
  • [argo][feedback] refactor to match code style (refactor some conditionals)
  • [argo][feedback] remove k8s client and make KubernetesArgoJobSet directly use kubernetes_sdk
  • [argo][feedback] added environment_variables_from_selectors for code simplification -
  • [argo][feedback] fix comment. - [argo][feedback] refactor condition for readabililty.
  • [argo][feedback] rollback temp boto3 installation change in metaflow env - [argo][feedback] remove rogue type hint