Netflix / metaflow

Open Source Platform for developing, scaling and deploying serious ML, AI, and data science systems
https://metaflow.org
Apache License 2.0
8.26k stars 774 forks source link

Errors resuming a workflow #1958

Closed robertdj closed 3 months ago

robertdj commented 3 months ago

I am starting to use Metaflow and I'm generally enjoying it. However, I am having problem using the resume argument, making my local development really slow.

Unfortunately, I'm having trouble narrowing this down to a minimal example, but I hope my generic explanation is sufficient.

Consider the flow from the resume help page https://docs.metaflow.org/metaflow/debugging#how-to-use-the-resume-command

from metaflow import FlowSpec, step

class DebugFlow(FlowSpec):

    @step
    def start(self):
        self.next(self.a, self.b)

    @step
    def a(self):
        self.x = 1
        self.next(self.join)

    @step
    def b(self):
        self.x = int('2fail')
        self.next(self.join)

    @step
    def join(self, inputs):
        print('a is %s' % inputs.a.x)
        print('b is %s' % inputs.b.x)
        print('total is %d' % sum(input.x for input in inputs))
        self.next(self.end)

    @step
    def end(self):
        pass

if __name__ == '__main__':
    DebugFlow()

If I run

python debug.py resume

It picks up at the correct step. If I fix the b step and the flow runs successfully I can also resume and metaflow clones all results. If I resume and specify a step (python debug.py resume b) three characteristic things happens:

savingoyal commented 3 months ago

we have a bug fix in-flight #1956

robertdj commented 3 months ago

Wow! What a quick a answer :-) Great with a fix -- it seems to work here.