argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.88k stars 3.17k forks source link

Supporting Step/Task output caching and refer in another workflows #2157

Open sarabala1979 opened 4 years ago

sarabala1979 commented 4 years ago

Summary

Cached step/task output (parameter or artifacts) can be referred to in another workflow to avoid the same step execution which will save time and resources.

Similar issue #944

Motivation

In ETL and ML use cases, Some steps/tasks in all workflow will be the same output if the same input is passed. If Argo has the ability to cache the output for those steps, it can be referred to in another workflow. The cached steps/task execution will be skipped and just used the cached output

Proposal

  1. The template will have a flag for cachable.
    - name: gen-number-list
     cachable: true
    script:
      image: python:alpine3.6
      command: [python]
      source: |
        import json
        import sys
        json.dump([i for i in range(20, 31)], sys.stdout)
  2. Create new CRD which will hold the node status of the latest succeed template
apiVersion: argoproj.io/v1alpha1
kind: CachedNodeStatus
metadata:
  Name:  retry-to-completion
  Namespace: argo
  labels:
    lastExecution: 02/04/2020  19:30
spec:
  boundaryID: steps-6c4tm
  displayName: hello1
  finishedAt: "2020-02-04T06:22:28Z"
  id: steps-6c4tm-1651667224
  inputs:
  parameters:
  - name: message
    value: hello1
  message: 'failed to save outputs: Failed to establish pod watch: unknown (get
  pods)'
  name: steps-6c4tm[0].hello1
  phase: Error
  startedAt: "2020-02-04T06:22:09Z"
  templateName: whalesay
  type: Pod
  1. cache reference:
    - name: gen-number-list
    fatchFromCache: true
    script:
      image: python:alpine3.6
      command: [python]
      source: |
        import json
        import sys
        json.dump([i for i in range(20, 31)], sys.stdout)
TekTimmy commented 4 years ago

I agree it's not a bad idea but for Argo you are responsible for the data flow! Which means you copy the results of the step into S3 and let all depending steps copy the data back. I use Amazon EKS and it's clearly restricted to "WriteReadOnce" volumes (EBS) which means a volume can be mounted to one node only.

What could be possible technically is to have separation and aggregation of artifacts. So separation would mean i copy data from one volume to many other volumes and aggregation means I copy from many volumes to one. This would allow a single step producing results that are processed in parallel by the next step without the need of using S3 buckets in between. With EKS there is still the restriction, that one EBS volume is restricted to it's Availiblity Zone (AZ) which is a problem for aggregation when the volumes have been created in different AZs.