I am starting to use Metaflow and I'm generally enjoying it. However, I am having problem using the resume argument, making my local development really slow.
Unfortunately, I'm having trouble narrowing this down to a minimal example, but I hope my generic explanation is sufficient.
from metaflow import FlowSpec, step
class DebugFlow(FlowSpec):
@step
def start(self):
self.next(self.a, self.b)
@step
def a(self):
self.x = 1
self.next(self.join)
@step
def b(self):
self.x = int('2fail')
self.next(self.join)
@step
def join(self, inputs):
print('a is %s' % inputs.a.x)
print('b is %s' % inputs.b.x)
print('total is %d' % sum(input.x for input in inputs))
self.next(self.end)
@step
def end(self):
pass
if __name__ == '__main__':
DebugFlow()
If I run
python debug.py resume
It picks up at the correct step. If I fix the b step and the flow runs successfully I can also resume and metaflow clones all results.
If I resume and specify a step (python debug.py resume b) three characteristic things happens:
Cloning happens in a random order. That is, tasks are not cloned in the order specified by the flow.
I get an error: "No completed attempts of the task was found for task ..." for b step I specified.
For a I get the expected message: "Cloning results of a previously run task 1723556849830407/a/2". But if I now try to resume from a I get the error message with "No completed attempts ..."
I am starting to use Metaflow and I'm generally enjoying it. However, I am having problem using the
resume
argument, making my local development really slow.Unfortunately, I'm having trouble narrowing this down to a minimal example, but I hope my generic explanation is sufficient.
Consider the flow from the
resume
help page https://docs.metaflow.org/metaflow/debugging#how-to-use-the-resume-commandIf I run
It picks up at the correct step. If I fix the
b
step and the flow runs successfully I can also resume and metaflow clones all results. If I resume and specify a step (python debug.py resume b
) three characteristic things happens:b
step I specified.a
I get the expected message: "Cloning results of a previously run task 1723556849830407/a/2". But if I now try to resume froma
I get the error message with "No completed attempts ..."