Netflix / metaflow

:rocket: Build and manage real-life ML, AI, and data science projects with ease!
https://metaflow.org
Apache License 2.0
8.02k stars 752 forks source link

Resume from failed step on AWS step functions #480

Open zexuan-zhou opened 3 years ago

zexuan-zhou commented 3 years ago

I know we can resume a job (that is run by AWS step function) from a failed step by cli command on local machines. But is there a ways to directly resume a failed step on AWS step functions without going through local machine cli command?

savingoyal commented 3 years ago

Hi @zexuan-zhou, Unfortunately, AWS Step Functions doesn't provide any native capability to restart a failed execution. However, theoretically, it's feasible to create a parameterized flow (which takes in run_id and step) and copies over the state from run_id for all steps prior to step and execute the remainder of the flow as usual.

zexuan-zhou commented 3 years ago

Hi @savingoyal Thank you. Would you mind explaining a bit more? Are you saying that I should create another Metaflow flow and deploy it to AWS step that simply just copies a failed flow and restart it? I think I can understand the logic but I couldn't image how that flow will look like.

savingoyal commented 3 years ago

@zexuan-zhou Apologies for the delayed response. It depends on your exact use case, by you can access values from previous executions and assign those to your step using the client API. It will be a bit clunky tbh.

tarunrao541 commented 1 year ago

Hi @zexuan-zhou , how to resume a job (that is run by AWS step function) from a failed step using CLI command. Could you share me the command?

Start -> Activity1 -> Activity2 -> Activity3 -> Activity4 -> Stop

When an execution fails during some activity, let's say Activity2, the execution is marked as failure.

Now, is there anyway to resume this failed execution from the activity(Activity2) during which it failed earlier?

seanv507 commented 1 year ago

you can look at this https://aws.amazon.com/blogs/compute/resume-aws-step-functions-from-any-state/

it will resume from the last successful state. Note that if you have a activity 2 was a parallel job and 1 failed out of eg 100, then all 100 will still be redone.

tuulos commented 1 year ago

here's more discussion re: resuming production runs https://outerbounds-community.slack.com/archives/C02116BBNTU/p1675706066639959?thread_ts=1673332947.917959&cid=C02116BBNTU