Netflix / metaflow

Open Source Platform for developing, scaling and deploying serious ML, AI, and data science systems
https://metaflow.org
Apache License 2.0
8.26k stars 774 forks source link

metaflow failed to resume a flow #1965

Closed xujiboy closed 2 months ago

xujiboy commented 3 months ago

Hi, recently I noticed that metaflow is having trouble to resume a previously failed run. My call pattern is like the following:

python my_flow resume my_step --origin-run-id 12345678

then I got this error msg, no matter which step I choose to resume:

Data store error:
    No completed attempts of the task was found for task 'MyFlow/1723829756519226/infer_train_and_test_dates/41'

From the exection log I can see that metaflow is not cloning the previous successful steps from the start.

This issues was observed for metaflow version 2.12.11, but it goes away if I downgrade to 2.12.5.

savingoyal commented 3 months ago

1956 is addressing this bug :)