Closed gravesee closed 2 months ago
Ok, I think I get what's happening. It's a side-effect, so I think we want to model it as so... I think getting it to recompute every time might be tricky (the memoization is a pretty core feature), but IMO have quite a few options to make the code clean. Before I dig in I want to make sure that I understand it properly -- any chance you could post a little code/pseudocode to validate?
Also happy to jump on a call and walk through -- might make the back and forth faster!
Here is a simple use case that should pass the tests if the requested feature is added to Hamilton:
def no_memo() -> dict:
return {}
def func1(no_memo: dict) -> dict:
no_memo['func1'] = 'val1'
return no_memo
def func2(no_memo: dict) -> dict:
no_memo['func2'] = 'val2'
return no_memo
if __name__ == "__main__":
import __main__ as main
from hamilton.driver import Driver
from hamilton.base import SimplePythonGraphAdapter, DictResult
dr = Driver({}, main, adapter=SimplePythonGraphAdapter(DictResult))
res = dr.execute(["func1", "func2"])
print(res)
# {'func1': {'func1': 'val1', 'func2': 'val2'},
# 'func2': {'func1': 'val1', 'func2': 'val2'}}
assert res['func1'] == {"func1": "val1"}
assert res['func2'] == {"func2": "val2"}
Yep -- got it.
Ok, so I get what you're doing/asking, but I'm not 100% convinced this is the right solution to the original problem. Of two minds:
@parameterize
to create a bunch of these, each individually labeled, then refer to those ones as the upstream.get()
calls it. This is like the .next() call but slightly cleaner@subdag
to create a bunch of them + the downstream functions (messy but this is kind of nice if you're doing similar transformations in each module and want to share/not repeat logic, but this is a more advanced version of (1)).Honestly? Not sure if any of these make you happy -- happy to walk you through them in more detail/prototype. Looking above, none of them would be quite as simple to solidify the use-case as supporting it the way you want to :) I'm partial to (1), as these objects are supposed to be separate, and I'm not convinced the DAG shouldn't think of them as that. I also like (4), if you don't mind the instantiation of the object to be shared with the function.
@forget()
is a nice way to tell the framework what's a side-effect and what's not.Thoughts?
My use-case isn't a dataframe problem so I understand if it doesn't fit the Hamilton philosophy. Specifically, I have to instantiate a docx.Document
and pass it a template from my local filesystem. It's two lines of code that I end up repeating in a lot of node functions. It's kind of a side-effect, but I think a similar use-case would be a node that returns the current time every time it is passed as a dependency to, say, understand the execution order and clock time of the DAG. Is that a side-effect?
I think option 2 could solve my problem fairly cleanly. The dependencies would be typed correctly so that information would be there for the user to see. And calling .get
isn't that bad. But if this is still under consideration, I like the @forget
decorator.
My use-case isn't a dataframe problem so I understand if it doesn't fit the Hamilton philosophy. Specifically, I have to instantiate a
docx.Document
and pass it a template from my local filesystem. It's two lines of code that I end up repeating in a lot of node functions. It's kind of a side-effect, but I think a similar use-case would be a node that returns the current time every time it is passed as a dependency to, say, understand the execution order and clock time of the DAG. Is that a side-effect?I think option 2 could solve my problem fairly cleanly. The dependencies would be typed correctly so that information would be there for the user to see. And calling
.get
isn't that bad. But if this is still under consideration, I like the@forget
decorator.
Yep -- I don't think yours is too far out of the ordinary. Hamilton was built initially for dataframes, but we've seen a wide array of use-cases here. Now that I'm thinking about it more, instead of @forget
, I'd propose @resource
that takes in some sort of policy
. Again, drawing from pytest fixtures :)
@resource(policy=FORGET)
def document() -> docx.Document:
# how you want it to behave
...
@resource(policy=CACHE)
def document() -> docx.Document:
# how it currently behaves
...
Re: side effects, I think yours barely count -- its more state
. In which case, its just a way to make it easier to produce something of the given state, as in after the dependent node gets executed the state becomes immutable. Overall its side-effect free, so maybe its a reasonable way to approach it?
I'd say that the DAG-tracking example should probably be some other sort of hook (E.G. through graph adapters/whatnot) rather than a dependency -- IMO it has the potential to confuse the user when visualizing the DAG... We've prototyped/are working with stuff that does exactly this internally if you're curious.
@skrawcz thoughts on the resource
decorator?
Just thinking through all the ways and what the code would look like -- here's the infinite generator way you describe (sorry had to write it out):
from typing import Generator
def no_memo() -> Generator[dict, None, None]:
def dict_generator():
yield {}
return dict_generator
def func1(no_memo: Generator[dict, None, None]) -> dict:
_no_memo = next(no_memo())
_no_memo['func1'] = 'val1'
return _no_memo
def func2(no_memo: Generator[dict, None, None]) -> dict:
_no_memo = next(no_memo())
_no_memo['func2'] = 'val2'
return _no_memo
if __name__ == "__main__":
import __main__ as main
from hamilton.driver import Driver
from hamilton.base import SimplePythonGraphAdapter, DictResult
dr = Driver({}, main, adapter=SimplePythonGraphAdapter(DictResult))
res = dr.execute(["func1", "func2"])
print(res)
assert res['func1'] == {"func1": "val1"}
assert res['func2'] == {"func2": "val2"}
We could have some smarts around knowing the types and if it's a (basic) generator, to call next() and pass that in, e.g. effectively enable this API:
from typing import Generator
def no_memo() -> Generator[dict, None, None]:
def dict_generator():
yield {}
return dict_generator
def func1(no_memo: dict) -> dict:
no_memo['func1'] = 'val1'
return no_memo
def func2(no_memo: dict) -> dict:
no_memo['func2'] = 'val2'
return no_memo
Thoughts? i.e. if the function takes in the generator -- then don't do anything, else check that the annotation matches the type on the function, and if so, do next()
in the framework? I like the generator approach because I think it makes it clear to anyone reading, that it's going to be called multiple times... (though I need to figure out a way to annotate things so my static analyzer doesn't complain).
Just thinking through all the ways and what the code would look like -- here's the infinite generator way you describe (sorry had to write it out):
from typing import Generator def no_memo() -> Generator[dict, None, None]: def dict_generator(): yield {} return dict_generator def func1(no_memo: Generator[dict, None, None]) -> dict: _no_memo = next(no_memo()) _no_memo['func1'] = 'val1' return _no_memo def func2(no_memo: Generator[dict, None, None]) -> dict: _no_memo = next(no_memo()) _no_memo['func2'] = 'val2' return _no_memo if __name__ == "__main__": import __main__ as main from hamilton.driver import Driver from hamilton.base import SimplePythonGraphAdapter, DictResult dr = Driver({}, main, adapter=SimplePythonGraphAdapter(DictResult)) res = dr.execute(["func1", "func2"]) print(res) assert res['func1'] == {"func1": "val1"} assert res['func2'] == {"func2": "val2"}
We could have some smarts around knowing the types and if it's a (basic) generator, to call next() and pass that in, e.g. effectively enable this API:
from typing import Generator def no_memo() -> Generator[dict, None, None]: def dict_generator(): yield {} return dict_generator def func1(no_memo: dict) -> dict: no_memo['func1'] = 'val1' return no_memo def func2(no_memo: dict) -> dict: no_memo['func2'] = 'val2' return no_memo
Thoughts? i.e. if the function takes in the generator -- then don't do anything, else check that the annotation matches the type on the function, and if so, do
next()
in the framework? I like the generator approach because I think it makes it clear to anyone reading, that it's going to be called multiple times... (though I need to figure out a way to annotate things so my static analyzer doesn't complain).
Intriguing, but this feels like a messy thing to put on the framework. I don't really agree that it's clear that something is getting called multiple times (maybe this is because the Generator
type parameters are super confusing in python). We don't have any other instances where the type of a function determines how it should be called -- its all been in the decorators.
IMO if we're going to start doing more magical meta-programming, we want to limit the space to reduce cognitive burden. Decorators are meant to do that, but if we also change the output type and do something special dependent on that its just more moving pieces to figure out how a dataflow should work.
Intriguing, but this feels like a messy thing to put on the framework.
It's not more messy than the alternatives to have the framework deal with this.
Decorators are meant to do that,
Good idea. This looks simpler:
@generator
def no_memo() -> dict:
return {} # or perhaps this should be `yield {}` ?
def func1(no_memo: dict) -> dict:
no_memo['func1'] = 'val1'
return no_memo
def func2(no_memo: dict) -> dict:
no_memo['func2'] = 'val2'
return no_memo
FWIW I had this same need in my own framework. I attached a handle to the generator for each method field, and it autowired itself with upstream function arguments the same way you have done. E.g. func2.generator()
would directly invoke the created generator, which in turn would pull in output from the no_memo
generator to satisfy its named arguments, etc.
In my version, all functions behaved more or less like generators. Standard functions that returned values would just have those values cached and re-used on subsequent calls. Those repeated objects were read only.
FWIW I had this same need in my own framework. I attached a handle to the generator for each method field, and it autowired itself with upstream function arguments the same way you have done. E.g.
func2.generator()
would directly invoke the created generator, which in turn would pull in output from theno_memo
generator to satisfy its named arguments, etc.In my version, all functions behaved more or less like generators. Standard functions that returned values would just have those values cached and re-used on subsequent calls. Those repeated objects were read only.
Oh awesome -- curious, is your framework open-source? Would love to compare notes. And I like that idea of all functions behaving like generators -- it has a clean feel to it, and naturally implements memoization. I guess the interesting thing is that these are generators with two modes:
Whereas the generator abstraction allow for a lot more (different values every time, etc...) So I would argue that implementing them using generators might sense, but exposing that to the user might be a little confusing? But yeah! Would love to see what you came up with!
In my version, all functions behaved more or less like generators. Standard functions that returned values would just have those values cached and re-used on subsequent calls. Those repeated objects were read only.
@jdonaldson Did you end up having a chain of them? If so, how did that work and what kind of application did that enable?
Yeah, it's a chain.
It worked by precomputing the dag from the field names (as you have done). However, It would also augment each decorated method with its own generator function that resolved the specific path through the dag, and then yielded data through the chain of functions itself as an iterator.
I'm checking to see if you see a way to integrate this functionality at a high level. If one can stream in this fashion, it's possible to build really sophisticated pipelines for training and serving from a single method. This is totally my jam for doing my modeling work, and I've seen so many people "almost" get it right (dagster). The streaming/yielding capability is key. I'm stubborn on that, and totally flexible on pretty much everything else.
I'm not partial in how one exposes the generator from a decorated DAG node. I just think it makes sense to have a "generator" function field, so it's clear what its basic behavior is. If you have better ways of peeling the banana here, I'm all ears.
The application is basically any kind of composable DAG structure for ML pipelines. I've used them to sketch out complex DS transformations, use them to generate notebooks, investigate ML model behaviors, and then deploy training/inference endpoints. It's my one shop approach for all this. I'd really love to find a home for the technique that is not a totally new project.
Also, I'm sure this is named after Hamilton Paths, which is a node traversal in a graph. However it's not specifically DAG related, is it? I might be missing a reference there.
It worked by precomputing the dag from the field names (as you have done). However, It would also augment each decorated method with its own generator function that resolved the specific path through the dag, and then yielded data through the chain of functions itself as an iterator.
I'm checking to see if you see a way to integrate this functionality at a high level. If one can stream in this fashion, it's possible to build really sophisticated pipelines for training and serving from a single method. This is totally my jam for doing my modeling work, and I've seen so many people "almost" get it right (dagster). The streaming/yielding capability is key. I'm stubborn on that, and totally flexible on pretty much everything else.
I'm not partial in how one exposes the generator from a decorated DAG node. I just think it makes sense to have a "generator" function field, so it's clear what its basic behavior is. If you have better ways of peeling the banana here, I'm all ears.
We have a bunch of flexibility with how Hamilton is set up. We have some plans to refactor some internals too, which if we want to enable chaining generators, we'd need to do. So happy to think about how to approach this. Otherwise our style of development is to design an API we'd want users to use and see how that looks and work from there. Do you have a toy use case we could try to code up?
Also, I'm sure this is named after Hamilton Paths, which is a node traversal in a graph. However it's not specifically DAG related, is it? I might be missing a reference there.
Origin story of name is three fold:
Ok, I'm sold on the name if it has anything to do with Alexander Hamilton. Graph connections are a bonus. :)
I'll put the code together in a repo here soon with a notebook showing its use. I'll post a link to it here.
Ok, I'm sold on the name if it has anything to do with Alexander Hamilton. Graph connections are a bonus. :)
I'll put the code together in a repo here soon with a notebook showing its use. I'll post a link to it here.
Awesome! Oh, man, I'm going to have to add this story in the documentation, but...
When we were coming up with Hamilton, I prototyped a framework (that looked a lot like dagster/burr) -- E.G. where you defined transformations and you wired them together separately, as opposed to Hamilton where they're the same. I called it "burr", and the team we were presenting it to liked "hamilton" (@skrawcz's design) a lot better :)
LOL, I'm sure methods will need to be written a certain way with this approach... Instead of Ruby on Rails, I propose we call the approach Hamilton on Ice.
LOL, I'm sure methods will need to be written a certain way with this approach... Instead of Ruby on Rails, I propose we call the approach Hamilton on Ice.
Hahahaha love it!
Ok, here's a simple example showing how the generator might work : https://github.com/jdonaldson/hamilton-on-ice/blob/main/notebooks/Streaming%20Example.ipynb
@artifacts are basically functions with a single return. Those get repeated each iteration. I use those for things like config and model artifacts.
I'll do a proper streaming example next, just wanted to show baby steps. I added in a small utility to render a dag there at the bottom.
@jdonaldson took a look. Looks quite interesting!
It seems like you're using classes like Hamilton uses modules?
Things I'm thinking about:
Otherwise a more realistic example would help me grok usage better.
Thanks for checking it out! It's definitely more convoluted for a basic training/transform usecase, and this definitely raises a needed cause for concern. The payoff I'm arguing for here is that streaming interfaces for DAGs can be used for both training and inference pipelines, as long as one can accept a parameterized minibatch technique for training and serving scenarios. The rest of the implementation here tries to serve that goal, and who knows? Maybe you know a better way to serve the underlying need. I'm not really married to my implementation at all. It just works more or less in a useful way for me for now.
I have some thoughts/suggestions here on your questions:
with Hamilton, we can "inject" decorators as we walk the DAG, so would we need to annotate e.g. @jsonl directly?
The choice of decoration there gives one more control over what specifically happens at that step. For instance, one may stream results of one transformation to a series of featherized panda files on s3, while another transformation may need to go to a database. Then, when the pipeline is deployed, everything needs to be serialized in a different way with credentials located in the ENV. I really don't want a framework to solve everything here for me. I'm happy with a decorator I can sprinkle around on methods, and wire the complex behaviors with custom python code. I run into so much weird stuff sometimes interacting with different CRUD services, and this is a great way of dealing with it. That said, the class oriented foundation of the code here permits hierarchical configuration with decorators. One can decorate methods, or simply decorate the class, which would decorate the methods.
what's the "driver" API that we'd want here.
I think I know what you mean, but there's a number of potentially different of meanings for driver. Let's chat.
we should set up time to chat more about the API and experience we'd want someone to have -- and importantly gotchas to avoid/ways to help people better debug/understand errors.
Exactly! Happy to do so.
OK, I had some thoughts about this, but I wanted to move issues as we've moved a bit beyond this one :) I do think the implementation of this issue could be solved by the one in #19, but that's a slightly more general case so I'm going to pen my thoughts there. Its just one API -- I think there are other approaches as well...
I think we got some of this functionality with the parallelizeable / collect constructs. Please re-open if not.
Yeah I need to play with the new stuff there, I think you more or less have it with some of your recent features, and I just need to pull it together again at some point.
Is your feature request related to a problem? Please describe. For a documentation project I am using Hamilton to generate Word output using the python-docx package. I have nodes representing sections of the Word document. These sections are organized into their own modules. The function code takes a
docx.Document
instance that should not be shared between other sections. Once the sections are all computed, I use a GraphAdapter to combine them into a single Word document. Currently, I am instantiating a new instance ofdocx.Document
in each section. I would like to put this configuration in a single node that gets re-computed each time it is used as a dependency.Describe the solution you'd like A decorator that instructs the driver to always recompute the node.
Describe alternatives you've considered I have tried creating an infinite generator that returns the instantiated
docx.Document
but the usage within other nodes is ugly (requiring a call tonext
)Additional context