Netflix / metaflow

:rocket: Build and manage real-life ML, AI, and data science projects with ease!
https://metaflow.org
Apache License 2.0
7.8k stars 738 forks source link

Support for flows with intersecting branches #182

Open miloszbednarzak opened 4 years ago

miloszbednarzak commented 4 years ago

I'd like to implement flow in which its branches intersects with each other like: test-graph

Code: ```python from metaflow import FlowSpec, step class TestFlow(FlowSpec): @step def start(self): self.next(self.a, self.b) @step def a(self): self.next(self.a1, self.a2) @step def a1(self): self.next(self.a1b1) @step def a2(self): self.next(self.a2b2) @step def b(self): self.next(self.b1, self.b2) @step def b1(self): self.next(self.a1b1) @step def b2(self,): self.next(self.a2b2) @step def a1b1(self, inputs): self.merge_artifacts(inputs) self.next(self.join) @step def a2b2(self, inputs): self.merge_artifacts(inputs) self.next(self.join) @step def join(self, inputs): self.merge_artifacts(inputs) self.next(self.end) @step def end(self): pass if __name__ == "__main__": TestFlow() ```

It throws error:

Step a1b1 joins steps from unrelated splits. Ensure that there is a matching join for every split.

I know I can reimplement this like: test-graph2

Code: ```python from metaflow import FlowSpec, step class TestFlow(FlowSpec): @step def start(self): self.next(self.a, self.b) @step def a(self): self.next(self.a1, self.a2) @step def a1(self): self.next(self.a12) @step def a2(self): self.next(self.a12) @step def b(self): self.next(self.b1, self.b2) @step def b1(self): self.next(self.b12) @step def b2(self,): self.next(self.b12) @step def a12(self, inputs): self.merge_artifacts(inputs) self.next(self.as_bs) @step def b12(self, inputs): self.merge_artifacts(inputs) self.next(self.as_bs) @step def as_bs(self, inputs): self.merge_artifacts(inputs) self.next(self.end) @step def end(self): pass if __name__ == "__main__": TestFlow() ```

In my case looking from a perspective of visual graph second implementation looks cleaner, but from implementation perspective it brings unnecessary additional steps.

Are there any plans to add support for this kind of Flows? Thanks!

savingoyal commented 4 years ago

@miloszbednarzak I am curious what would be the use case for the first graph? The reason we lean for graphs of later nature is that it's easier to reason about their execution characteristics.

miloszbednarzak commented 4 years ago

@savingoyal Let's say that functions in steps a1 and b1 outputs variables of kind one, and a2,b2 accordingly. I wanted in step a1b1 and a2b2 to group all data coming from input branches by their kinds. So if the first option was possible I could do something like that:

# in a1b1 step
ones_data = [input.one_kind for input in inputs]
# in a2b2 step
twos_data = [input.two_kind for input in inputs]

In second, legitimate variant I need to group them in as_bs step like this:

ones_data = [self.a1_data, self.b1_data]
twos_data = [self.a2_data, self.b2_data]

In this toy example there is not much difference, but my actual graph is much more complex and I'd like to group data by kind not by using their artefact variable name in grouping operation, but by pointing output to the step which will group all inputs automatically.

savingoyal commented 4 years ago

Makes sense. We are discussing a number of enhancements to the flow structure, including introducing the notion of sub workflows, but the timing is TBD.

valayDave commented 4 years ago

@savingoyal : Are you guys planning to add dynamic DAG support ?

savingoyal commented 4 years ago

@valayDave The exact implementation, as well as UX for sub-workflows, is TBD.