DAGWorks-Inc / hamilton

Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
https://hamilton.dagworks.io/en/latest/
BSD 3-Clause Clear License
1.73k stars 111 forks source link

node that is always re-run #90

Closed gravesee closed 2 months ago

gravesee commented 1 year ago

Is your feature request related to a problem? Please describe. For a documentation project I am using Hamilton to generate Word output using the python-docx package. I have nodes representing sections of the Word document. These sections are organized into their own modules. The function code takes a docx.Document instance that should not be shared between other sections. Once the sections are all computed, I use a GraphAdapter to combine them into a single Word document. Currently, I am instantiating a new instance of docx.Document in each section. I would like to put this configuration in a single node that gets re-computed each time it is used as a dependency.

Describe the solution you'd like A decorator that instructs the driver to always recompute the node.

Describe alternatives you've considered I have tried creating an infinite generator that returns the instantiated docx.Document but the usage within other nodes is ugly (requiring a call to next)

Additional context

elijahbenizzy commented 1 year ago

Ok, I think I get what's happening. It's a side-effect, so I think we want to model it as so... I think getting it to recompute every time might be tricky (the memoization is a pretty core feature), but IMO have quite a few options to make the code clean. Before I dig in I want to make sure that I understand it properly -- any chance you could post a little code/pseudocode to validate?

Also happy to jump on a call and walk through -- might make the back and forth faster!

gravesee commented 1 year ago

Here is a simple use case that should pass the tests if the requested feature is added to Hamilton:

def no_memo() -> dict:
    return {}

def func1(no_memo: dict) -> dict:
    no_memo['func1'] = 'val1'
    return no_memo

def func2(no_memo: dict) -> dict:
    no_memo['func2'] = 'val2'
    return no_memo

if __name__ == "__main__":
    import __main__ as main

    from hamilton.driver import Driver
    from hamilton.base import SimplePythonGraphAdapter, DictResult

    dr = Driver({}, main, adapter=SimplePythonGraphAdapter(DictResult))
    res = dr.execute(["func1", "func2"])

    print(res)
    # {'func1': {'func1': 'val1', 'func2': 'val2'},
    #  'func2': {'func1': 'val1', 'func2': 'val2'}}
    assert res['func1'] == {"func1": "val1"}
    assert res['func2'] == {"func2": "val2"}
elijahbenizzy commented 1 year ago

Yep -- got it.

Ok, so I get what you're doing/asking, but I'm not 100% convinced this is the right solution to the original problem. Of two minds:

Con:

Honestly? Not sure if any of these make you happy -- happy to walk you through them in more detail/prototype. Looking above, none of them would be quite as simple to solidify the use-case as supporting it the way you want to :) I'm partial to (1), as these objects are supposed to be separate, and I'm not convinced the DAG shouldn't think of them as that. I also like (4), if you don't mind the instantiation of the object to be shared with the function.

Pro:

Thoughts?

gravesee commented 1 year ago

My use-case isn't a dataframe problem so I understand if it doesn't fit the Hamilton philosophy. Specifically, I have to instantiate a docx.Document and pass it a template from my local filesystem. It's two lines of code that I end up repeating in a lot of node functions. It's kind of a side-effect, but I think a similar use-case would be a node that returns the current time every time it is passed as a dependency to, say, understand the execution order and clock time of the DAG. Is that a side-effect?

I think option 2 could solve my problem fairly cleanly. The dependencies would be typed correctly so that information would be there for the user to see. And calling .get isn't that bad. But if this is still under consideration, I like the @forget decorator.

elijahbenizzy commented 1 year ago

My use-case isn't a dataframe problem so I understand if it doesn't fit the Hamilton philosophy. Specifically, I have to instantiate a docx.Document and pass it a template from my local filesystem. It's two lines of code that I end up repeating in a lot of node functions. It's kind of a side-effect, but I think a similar use-case would be a node that returns the current time every time it is passed as a dependency to, say, understand the execution order and clock time of the DAG. Is that a side-effect?

I think option 2 could solve my problem fairly cleanly. The dependencies would be typed correctly so that information would be there for the user to see. And calling .get isn't that bad. But if this is still under consideration, I like the @forget decorator.

Yep -- I don't think yours is too far out of the ordinary. Hamilton was built initially for dataframes, but we've seen a wide array of use-cases here. Now that I'm thinking about it more, instead of @forget, I'd propose @resource that takes in some sort of policy. Again, drawing from pytest fixtures :)

@resource(policy=FORGET)
def document() -> docx.Document:
    # how you want it to behave
    ...

@resource(policy=CACHE)
def document() -> docx.Document:
    # how it currently behaves
    ...

Re: side effects, I think yours barely count -- its more state. In which case, its just a way to make it easier to produce something of the given state, as in after the dependent node gets executed the state becomes immutable. Overall its side-effect free, so maybe its a reasonable way to approach it?

I'd say that the DAG-tracking example should probably be some other sort of hook (E.G. through graph adapters/whatnot) rather than a dependency -- IMO it has the potential to confuse the user when visualizing the DAG... We've prototyped/are working with stuff that does exactly this internally if you're curious.

@skrawcz thoughts on the resource decorator?

skrawcz commented 1 year ago

Just thinking through all the ways and what the code would look like -- here's the infinite generator way you describe (sorry had to write it out):

from typing import Generator

def no_memo() -> Generator[dict, None, None]:
    def dict_generator():
        yield {}
    return dict_generator

def func1(no_memo: Generator[dict, None, None]) -> dict:
    _no_memo = next(no_memo()) 
    _no_memo['func1'] = 'val1'
    return _no_memo

def func2(no_memo: Generator[dict, None, None]) -> dict:
    _no_memo = next(no_memo())
    _no_memo['func2'] = 'val2'
    return _no_memo

if __name__ == "__main__":
    import __main__ as main

    from hamilton.driver import Driver
    from hamilton.base import SimplePythonGraphAdapter, DictResult

    dr = Driver({}, main, adapter=SimplePythonGraphAdapter(DictResult))
    res = dr.execute(["func1", "func2"])

    print(res)
    assert res['func1'] == {"func1": "val1"}
    assert res['func2'] == {"func2": "val2"}

We could have some smarts around knowing the types and if it's a (basic) generator, to call next() and pass that in, e.g. effectively enable this API:

from typing import Generator

def no_memo() -> Generator[dict, None, None]:
    def dict_generator():
        yield {}
    return dict_generator

def func1(no_memo: dict) -> dict: 
    no_memo['func1'] = 'val1'
    return no_memo

def func2(no_memo: dict) -> dict:
    no_memo['func2'] = 'val2'
    return no_memo

Thoughts? i.e. if the function takes in the generator -- then don't do anything, else check that the annotation matches the type on the function, and if so, do next() in the framework? I like the generator approach because I think it makes it clear to anyone reading, that it's going to be called multiple times... (though I need to figure out a way to annotate things so my static analyzer doesn't complain).

elijahbenizzy commented 1 year ago

Just thinking through all the ways and what the code would look like -- here's the infinite generator way you describe (sorry had to write it out):

from typing import Generator

def no_memo() -> Generator[dict, None, None]:
    def dict_generator():
        yield {}
    return dict_generator

def func1(no_memo: Generator[dict, None, None]) -> dict:
    _no_memo = next(no_memo()) 
    _no_memo['func1'] = 'val1'
    return _no_memo

def func2(no_memo: Generator[dict, None, None]) -> dict:
    _no_memo = next(no_memo())
    _no_memo['func2'] = 'val2'
    return _no_memo

if __name__ == "__main__":
    import __main__ as main

    from hamilton.driver import Driver
    from hamilton.base import SimplePythonGraphAdapter, DictResult

    dr = Driver({}, main, adapter=SimplePythonGraphAdapter(DictResult))
    res = dr.execute(["func1", "func2"])

    print(res)
    assert res['func1'] == {"func1": "val1"}
    assert res['func2'] == {"func2": "val2"}

We could have some smarts around knowing the types and if it's a (basic) generator, to call next() and pass that in, e.g. effectively enable this API:

from typing import Generator

def no_memo() -> Generator[dict, None, None]:
    def dict_generator():
        yield {}
    return dict_generator

def func1(no_memo: dict) -> dict: 
    no_memo['func1'] = 'val1'
    return no_memo

def func2(no_memo: dict) -> dict:
    no_memo['func2'] = 'val2'
    return no_memo

Thoughts? i.e. if the function takes in the generator -- then don't do anything, else check that the annotation matches the type on the function, and if so, do next() in the framework? I like the generator approach because I think it makes it clear to anyone reading, that it's going to be called multiple times... (though I need to figure out a way to annotate things so my static analyzer doesn't complain).

Intriguing, but this feels like a messy thing to put on the framework. I don't really agree that it's clear that something is getting called multiple times (maybe this is because the Generator type parameters are super confusing in python). We don't have any other instances where the type of a function determines how it should be called -- its all been in the decorators.

IMO if we're going to start doing more magical meta-programming, we want to limit the space to reduce cognitive burden. Decorators are meant to do that, but if we also change the output type and do something special dependent on that its just more moving pieces to figure out how a dataflow should work.

skrawcz commented 1 year ago

Intriguing, but this feels like a messy thing to put on the framework.

It's not more messy than the alternatives to have the framework deal with this.

Decorators are meant to do that,

Good idea. This looks simpler:

@generator
def no_memo() -> dict:
    return {} # or perhaps this should be `yield {}` ?

def func1(no_memo: dict) -> dict: 
    no_memo['func1'] = 'val1'
    return no_memo

def func2(no_memo: dict) -> dict:
    no_memo['func2'] = 'val2'
    return no_memo
jdonaldson commented 1 year ago

FWIW I had this same need in my own framework. I attached a handle to the generator for each method field, and it autowired itself with upstream function arguments the same way you have done. E.g. func2.generator() would directly invoke the created generator, which in turn would pull in output from the no_memo generator to satisfy its named arguments, etc.

In my version, all functions behaved more or less like generators. Standard functions that returned values would just have those values cached and re-used on subsequent calls. Those repeated objects were read only.

elijahbenizzy commented 1 year ago

FWIW I had this same need in my own framework. I attached a handle to the generator for each method field, and it autowired itself with upstream function arguments the same way you have done. E.g. func2.generator() would directly invoke the created generator, which in turn would pull in output from the no_memo generator to satisfy its named arguments, etc.

In my version, all functions behaved more or less like generators. Standard functions that returned values would just have those values cached and re-used on subsequent calls. Those repeated objects were read only.

Oh awesome -- curious, is your framework open-source? Would love to compare notes. And I like that idea of all functions behaving like generators -- it has a clean feel to it, and naturally implements memoization. I guess the interesting thing is that these are generators with two modes:

  1. Return a fixed value (E.G. cached, infinite generator returning the same thing)
  2. Return a new value each time (Uncached, infinite generator being recalled)

Whereas the generator abstraction allow for a lot more (different values every time, etc...) So I would argue that implementing them using generators might sense, but exposing that to the user might be a little confusing? But yeah! Would love to see what you came up with!

skrawcz commented 1 year ago

In my version, all functions behaved more or less like generators. Standard functions that returned values would just have those values cached and re-used on subsequent calls. Those repeated objects were read only.

@jdonaldson Did you end up having a chain of them? If so, how did that work and what kind of application did that enable?

jdonaldson commented 1 year ago

Yeah, it's a chain.

It worked by precomputing the dag from the field names (as you have done). However, It would also augment each decorated method with its own generator function that resolved the specific path through the dag, and then yielded data through the chain of functions itself as an iterator.

I'm checking to see if you see a way to integrate this functionality at a high level. If one can stream in this fashion, it's possible to build really sophisticated pipelines for training and serving from a single method. This is totally my jam for doing my modeling work, and I've seen so many people "almost" get it right (dagster). The streaming/yielding capability is key. I'm stubborn on that, and totally flexible on pretty much everything else.

I'm not partial in how one exposes the generator from a decorated DAG node. I just think it makes sense to have a "generator" function field, so it's clear what its basic behavior is. If you have better ways of peeling the banana here, I'm all ears.

jdonaldson commented 1 year ago

The application is basically any kind of composable DAG structure for ML pipelines. I've used them to sketch out complex DS transformations, use them to generate notebooks, investigate ML model behaviors, and then deploy training/inference endpoints. It's my one shop approach for all this. I'd really love to find a home for the technique that is not a totally new project.

jdonaldson commented 1 year ago

Also, I'm sure this is named after Hamilton Paths, which is a node traversal in a graph. However it's not specifically DAG related, is it? I might be missing a reference there.

skrawcz commented 1 year ago

It worked by precomputing the dag from the field names (as you have done). However, It would also augment each decorated method with its own generator function that resolved the specific path through the dag, and then yielded data through the chain of functions itself as an iterator.

I'm checking to see if you see a way to integrate this functionality at a high level. If one can stream in this fashion, it's possible to build really sophisticated pipelines for training and serving from a single method. This is totally my jam for doing my modeling work, and I've seen so many people "almost" get it right (dagster). The streaming/yielding capability is key. I'm stubborn on that, and totally flexible on pretty much everything else.

I'm not partial in how one exposes the generator from a decorated DAG node. I just think it makes sense to have a "generator" function field, so it's clear what its basic behavior is. If you have better ways of peeling the banana here, I'm all ears.

We have a bunch of flexibility with how Hamilton is set up. We have some plans to refactor some internals too, which if we want to enable chaining generators, we'd need to do. So happy to think about how to approach this. Otherwise our style of development is to design an API we'd want users to use and see how that looks and work from there. Do you have a toy use case we could try to code up?

Also, I'm sure this is named after Hamilton Paths, which is a node traversal in a graph. However it's not specifically DAG related, is it? I might be missing a reference there.

Origin story of name is three fold:

jdonaldson commented 1 year ago

Ok, I'm sold on the name if it has anything to do with Alexander Hamilton. Graph connections are a bonus. :)

I'll put the code together in a repo here soon with a notebook showing its use. I'll post a link to it here.

elijahbenizzy commented 1 year ago

Ok, I'm sold on the name if it has anything to do with Alexander Hamilton. Graph connections are a bonus. :)

I'll put the code together in a repo here soon with a notebook showing its use. I'll post a link to it here.

Awesome! Oh, man, I'm going to have to add this story in the documentation, but...

When we were coming up with Hamilton, I prototyped a framework (that looked a lot like dagster/burr) -- E.G. where you defined transformations and you wired them together separately, as opposed to Hamilton where they're the same. I called it "burr", and the team we were presenting it to liked "hamilton" (@skrawcz's design) a lot better :)

jdonaldson commented 1 year ago

LOL, I'm sure methods will need to be written a certain way with this approach... Instead of Ruby on Rails, I propose we call the approach Hamilton on Ice.

elijahbenizzy commented 1 year ago

LOL, I'm sure methods will need to be written a certain way with this approach... Instead of Ruby on Rails, I propose we call the approach Hamilton on Ice.

Hahahaha love it!

jdonaldson commented 1 year ago

Ok, here's a simple example showing how the generator might work : https://github.com/jdonaldson/hamilton-on-ice/blob/main/notebooks/Streaming%20Example.ipynb

jdonaldson commented 1 year ago

@artifacts are basically functions with a single return. Those get repeated each iteration. I use those for things like config and model artifacts.

jdonaldson commented 1 year ago

I'll do a proper streaming example next, just wanted to show baby steps. I added in a small utility to render a dag there at the bottom.

skrawcz commented 1 year ago

@jdonaldson took a look. Looks quite interesting!

It seems like you're using classes like Hamilton uses modules?

Things I'm thinking about:

Otherwise a more realistic example would help me grok usage better.

jdonaldson commented 1 year ago

Thanks for checking it out! It's definitely more convoluted for a basic training/transform usecase, and this definitely raises a needed cause for concern. The payoff I'm arguing for here is that streaming interfaces for DAGs can be used for both training and inference pipelines, as long as one can accept a parameterized minibatch technique for training and serving scenarios. The rest of the implementation here tries to serve that goal, and who knows? Maybe you know a better way to serve the underlying need. I'm not really married to my implementation at all. It just works more or less in a useful way for me for now.

I have some thoughts/suggestions here on your questions:

with Hamilton, we can "inject" decorators as we walk the DAG, so would we need to annotate e.g. @jsonl directly?

The choice of decoration there gives one more control over what specifically happens at that step. For instance, one may stream results of one transformation to a series of featherized panda files on s3, while another transformation may need to go to a database. Then, when the pipeline is deployed, everything needs to be serialized in a different way with credentials located in the ENV. I really don't want a framework to solve everything here for me. I'm happy with a decorator I can sprinkle around on methods, and wire the complex behaviors with custom python code. I run into so much weird stuff sometimes interacting with different CRUD services, and this is a great way of dealing with it. That said, the class oriented foundation of the code here permits hierarchical configuration with decorators. One can decorate methods, or simply decorate the class, which would decorate the methods.

what's the "driver" API that we'd want here.

I think I know what you mean, but there's a number of potentially different of meanings for driver. Let's chat.

we should set up time to chat more about the API and experience we'd want someone to have -- and importantly gotchas to avoid/ways to help people better debug/understand errors.

Exactly! Happy to do so.

elijahbenizzy commented 1 year ago

OK, I had some thoughts about this, but I wanted to move issues as we've moved a bit beyond this one :) I do think the implementation of this issue could be solved by the one in #19, but that's a slightly more general case so I'm going to pen my thoughts there. Its just one API -- I think there are other approaches as well...

skrawcz commented 2 months ago

I think we got some of this functionality with the parallelizeable / collect constructs. Please re-open if not.

jdonaldson commented 2 months ago

Yeah I need to play with the new stuff there, I think you more or less have it with some of your recent features, and I just need to pull it together again at some point.