Netflix / metaflow

Open Source Platform for developing, scaling and deploying serious ML, AI, and data science systems
https://metaflow.org
Apache License 2.0
8.26k stars 774 forks source link

[Ready for Review] Fix issue where resuming on successful run will fail. #1956

Closed darinyu closed 3 months ago

darinyu commented 3 months ago

when running resume with python flow.py resume {step_name}, user may want to rerun the "step" and continue with the execution. In current code, we will blindly copy everything as long as the step was successful.

The proposed change will look at step in topological order and skip cloning the ones on and after the specified steps (and then rerun everything if possible).

savingoyal commented 3 months ago

added a tiny window dressing PR related to this in #1963