argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.04k stars 3.2k forks source link

Feature Request: dedupe / memoize steps #1054

Closed bryanlarsen closed 4 years ago

bryanlarsen commented 6 years ago

Feature Request

If the inputs, container, command, etc. for a workflow step are all identical to a step performed in a previous workflow, the step should be skipped and the output from the previous step used instead.

I asked for this feature in slack (https://argoproj.slack.com/messages/C8J6SGN12/convo/C8J6SGN12-1539958408.000100/) and the response from Ed Lee was that I could add a shim to our step to do this. While possible, this is definitely suboptimal: the inputs are often tens of megabytes and possibly even hundreds that would have to be downloaded and fingerprinted just to do no work.

If such a feature would be welcome in Argo we would be interested in developing it and opening a PR. However, before we start we'd like to know if such a PR would be welcome. Perhaps more significantly, does the architecture of Argo make this a difficult task? For instance, if something slightly related but seemingly much more trivial such as https://github.com/argoproj/argo/issues/990 is hard to do, then our request may also be. Are we better off just writing our own Argo-lite?

edlee2121 commented 6 years ago

This would be a fantastic feature! Would make it much simpler to implement dynamic programming workflows.

andreimc commented 6 years ago

@bryanlarsen I am doing something similar and I ran into some issues: https://github.com/argoproj/argo/issues/1073 - my use case was being able to retry failed steps and carry the old successful steps over. This might help: https://github.com/argoproj/argo/blob/master/workflow/util/util.go#L326 I am using some of the util here: https://github.com/kubebuild/agent/blob/master/pkg/schedulers/build_scheduler.go#L174 methods there to do this, it works quite well in retrying failed jobs. Only some of the metadata gets lost for DAGs not sure exactly why yet.

alexlatchford commented 4 years ago

Hey @bryanlarsen did you come to any conclusions on the viability of this in Argo? Likely we're trying to investigate a similar issue albeit 18 months later!

We're looking at the cost of adopting Kubeflow (vs Metaflow/Flyte both of which natively support memoization) and looks like this is the likely the blocker (Kubeflow uses Argo under the hood fo ML workflow scheduling). Allowing caching of long runnings tasks (think ETL done on Spark for example) would give us significant speed ups in data engineer & scientists velocity for obvious reasons but definitely agree it's not a trivial problem to solve!

alexec commented 4 years ago

@mukulikak ☝️

talebzeghmi commented 4 years ago

Related https://github.com/kubeflow/pipelines/issues/1509

foobarbecue commented 4 years ago

Is there any way to do this this "work avoidance" pattern if using an artifact repository as opposed to a volume? I can't figure out a way to check if an artifact exists.

alexec commented 4 years ago

See #3066

alexec commented 4 years ago

Is there any way to do this this "work avoidance" pattern if using an artifact repository as opposed to a volume? I can't figure out a way to check if an artifact exists.

That should be possible. I'm hoping someone will contribute an example.

jessesuen commented 4 years ago

Duping this to https://github.com/argoproj/argo/issues/944, which we'll be starting work on. Please send any 👍 to that issue.